Message ID | 20190523105440.27045-1-rkagan@virtuozzo.com (mailing list archive) |
---|---|
Headers | show |
Series | establish nesting rule of BQL vs cpu-exclusive | expand |
Roman Kagan <rkagan@virtuozzo.com> writes: > I came across the following AB-BA deadlock: > > vCPU thread main thread > ----------- ----------- > async_safe_run_on_cpu(self, > async_synic_update) > ... [cpu hot-add] > process_queued_cpu_work() > qemu_mutex_unlock_iothread() > [grab BQL] > start_exclusive() cpu_list_add() > async_synic_update() finish_safe_work() > qemu_mutex_lock_iothread() cpu_exec_start() > > ATM async_synic_update seems to be the only async safe work item that > grabs BQL. However it isn't quite obvious that it shouldn't; in the > past there were more examples of this (e.g. > memory_region_do_invalidate_mmio_ptr). > > It looks like the problem is generally in the lack of the nesting rule > for cpu-exclusive sections against BQL, so I thought I would try to > address that. This patchset is my feeble attempt at this; I'm not sure > I fully comprehend all the consequences (rather, I'm sure I don't) hence > RFC. Hmm I think this is an area touched by: Subject: [PATCH v7 00/73] per-CPU locks Date: Mon, 4 Mar 2019 13:17:00 -0500 Message-Id: <20190304181813.8075-1-cota@braap.org> which has stalled on it's path into the tree. Last time I checked it explicitly handled the concept of work that needed the BQL and work that didn't. How do you trigger your deadlock? Just hot-pluging CPUs? > > Roman Kagan (2): > cpus-common: nuke finish_safe_work > cpus-common: assert BQL nesting within cpu-exclusive sections > > cpus-common.c | 12 ++++-------- > 1 file changed, 4 insertions(+), 8 deletions(-) -- Alex Bennée
On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > I came across the following AB-BA deadlock: > > > > vCPU thread main thread > > ----------- ----------- > > async_safe_run_on_cpu(self, > > async_synic_update) > > ... [cpu hot-add] > > process_queued_cpu_work() > > qemu_mutex_unlock_iothread() > > [grab BQL] > > start_exclusive() cpu_list_add() > > async_synic_update() finish_safe_work() > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > ATM async_synic_update seems to be the only async safe work item that > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > past there were more examples of this (e.g. > > memory_region_do_invalidate_mmio_ptr). > > > > It looks like the problem is generally in the lack of the nesting rule > > for cpu-exclusive sections against BQL, so I thought I would try to > > address that. This patchset is my feeble attempt at this; I'm not sure > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > RFC. > > Hmm I think this is an area touched by: > > Subject: [PATCH v7 00/73] per-CPU locks > Date: Mon, 4 Mar 2019 13:17:00 -0500 > Message-Id: <20190304181813.8075-1-cota@braap.org> > > which has stalled on it's path into the tree. Last time I checked it > explicitly handled the concept of work that needed the BQL and work that > didn't. I'm still trying to get my head around that patchset, but it looks like it changes nothing in regards to cpu-exclusive sections and safe work, so it doesn't make the problem go. > How do you trigger your deadlock? Just hot-pluging CPUs? Yes. The window is pretty narrow so I only saw it once although this test (where the vms are started and stopped and the cpus are plugged in and out) is in our test loop for quite a bit (probably 2+ years). Roman.
On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > I came across the following AB-BA deadlock: > > > > > > vCPU thread main thread > > > ----------- ----------- > > > async_safe_run_on_cpu(self, > > > async_synic_update) > > > ... [cpu hot-add] > > > process_queued_cpu_work() > > > qemu_mutex_unlock_iothread() > > > [grab BQL] > > > start_exclusive() cpu_list_add() > > > async_synic_update() finish_safe_work() > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > past there were more examples of this (e.g. > > > memory_region_do_invalidate_mmio_ptr). > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > RFC. > > > > Hmm I think this is an area touched by: > > > > Subject: [PATCH v7 00/73] per-CPU locks > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > which has stalled on it's path into the tree. Last time I checked it > > explicitly handled the concept of work that needed the BQL and work that > > didn't. > > I'm still trying to get my head around that patchset, but it looks like > it changes nothing in regards to cpu-exclusive sections and safe work, > so it doesn't make the problem go. > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > Yes. The window is pretty narrow so I only saw it once although this > test (where the vms are started and stopped and the cpus are plugged in > and out) is in our test loop for quite a bit (probably 2+ years). > > Roman. ping?
On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: > On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > > > I came across the following AB-BA deadlock: > > > > > > > > vCPU thread main thread > > > > ----------- ----------- > > > > async_safe_run_on_cpu(self, > > > > async_synic_update) > > > > ... [cpu hot-add] > > > > process_queued_cpu_work() > > > > qemu_mutex_unlock_iothread() > > > > [grab BQL] > > > > start_exclusive() cpu_list_add() > > > > async_synic_update() finish_safe_work() > > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > > past there were more examples of this (e.g. > > > > memory_region_do_invalidate_mmio_ptr). > > > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > > RFC. > > > > > > Hmm I think this is an area touched by: > > > > > > Subject: [PATCH v7 00/73] per-CPU locks > > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > > > which has stalled on it's path into the tree. Last time I checked it > > > explicitly handled the concept of work that needed the BQL and work that > > > didn't. > > > > I'm still trying to get my head around that patchset, but it looks like > > it changes nothing in regards to cpu-exclusive sections and safe work, > > so it doesn't make the problem go. > > > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > > > Yes. The window is pretty narrow so I only saw it once although this > > test (where the vms are started and stopped and the cpus are plugged in > > and out) is in our test loop for quite a bit (probably 2+ years). > > > > Roman. > > ping? ping?
On Fri, Jun 21, 2019 at 12:49:07PM +0000, Roman Kagan wrote: > On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: > > On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: > > > On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: > > > > > > > > Roman Kagan <rkagan@virtuozzo.com> writes: > > > > > > > > > I came across the following AB-BA deadlock: > > > > > > > > > > vCPU thread main thread > > > > > ----------- ----------- > > > > > async_safe_run_on_cpu(self, > > > > > async_synic_update) > > > > > ... [cpu hot-add] > > > > > process_queued_cpu_work() > > > > > qemu_mutex_unlock_iothread() > > > > > [grab BQL] > > > > > start_exclusive() cpu_list_add() > > > > > async_synic_update() finish_safe_work() > > > > > qemu_mutex_lock_iothread() cpu_exec_start() > > > > > > > > > > ATM async_synic_update seems to be the only async safe work item that > > > > > grabs BQL. However it isn't quite obvious that it shouldn't; in the > > > > > past there were more examples of this (e.g. > > > > > memory_region_do_invalidate_mmio_ptr). > > > > > > > > > > It looks like the problem is generally in the lack of the nesting rule > > > > > for cpu-exclusive sections against BQL, so I thought I would try to > > > > > address that. This patchset is my feeble attempt at this; I'm not sure > > > > > I fully comprehend all the consequences (rather, I'm sure I don't) hence > > > > > RFC. > > > > > > > > Hmm I think this is an area touched by: > > > > > > > > Subject: [PATCH v7 00/73] per-CPU locks > > > > Date: Mon, 4 Mar 2019 13:17:00 -0500 > > > > Message-Id: <20190304181813.8075-1-cota@braap.org> > > > > > > > > which has stalled on it's path into the tree. Last time I checked it > > > > explicitly handled the concept of work that needed the BQL and work that > > > > didn't. > > > > > > I'm still trying to get my head around that patchset, but it looks like > > > it changes nothing in regards to cpu-exclusive sections and safe work, > > > so it doesn't make the problem go. > > > > > > > How do you trigger your deadlock? Just hot-pluging CPUs? > > > > > > Yes. The window is pretty narrow so I only saw it once although this > > > test (where the vms are started and stopped and the cpus are plugged in > > > and out) is in our test loop for quite a bit (probably 2+ years). > > > > > > Roman. > > > > ping? > > ping? ping?
On 05/08/19 14:47, Roman Kagan wrote: > On Fri, Jun 21, 2019 at 12:49:07PM +0000, Roman Kagan wrote: >> On Thu, Jun 06, 2019 at 01:22:33PM +0000, Roman Kagan wrote: >>> On Mon, May 27, 2019 at 11:05:38AM +0000, Roman Kagan wrote: >>>> On Thu, May 23, 2019 at 12:31:16PM +0100, Alex Bennée wrote: >>>>> >>>>> Roman Kagan <rkagan@virtuozzo.com> writes: >>>>> >>>>>> I came across the following AB-BA deadlock: >>>>>> >>>>>> vCPU thread main thread >>>>>> ----------- ----------- >>>>>> async_safe_run_on_cpu(self, >>>>>> async_synic_update) >>>>>> ... [cpu hot-add] >>>>>> process_queued_cpu_work() >>>>>> qemu_mutex_unlock_iothread() >>>>>> [grab BQL] >>>>>> start_exclusive() cpu_list_add() >>>>>> async_synic_update() finish_safe_work() >>>>>> qemu_mutex_lock_iothread() cpu_exec_start() >>>>>> >>>>>> ATM async_synic_update seems to be the only async safe work item that >>>>>> grabs BQL. However it isn't quite obvious that it shouldn't; in the >>>>>> past there were more examples of this (e.g. >>>>>> memory_region_do_invalidate_mmio_ptr). >>>>>> >>>>>> It looks like the problem is generally in the lack of the nesting rule >>>>>> for cpu-exclusive sections against BQL, so I thought I would try to >>>>>> address that. This patchset is my feeble attempt at this; I'm not sure >>>>>> I fully comprehend all the consequences (rather, I'm sure I don't) hence >>>>>> RFC. >>>>> >>>>> Hmm I think this is an area touched by: >>>>> >>>>> Subject: [PATCH v7 00/73] per-CPU locks >>>>> Date: Mon, 4 Mar 2019 13:17:00 -0500 >>>>> Message-Id: <20190304181813.8075-1-cota@braap.org> >>>>> >>>>> which has stalled on it's path into the tree. Last time I checked it >>>>> explicitly handled the concept of work that needed the BQL and work that >>>>> didn't. >>>> >>>> I'm still trying to get my head around that patchset, but it looks like >>>> it changes nothing in regards to cpu-exclusive sections and safe work, >>>> so it doesn't make the problem go. >>>> >>>>> How do you trigger your deadlock? Just hot-pluging CPUs? >>>> >>>> Yes. The window is pretty narrow so I only saw it once although this >>>> test (where the vms are started and stopped and the cpus are plugged in >>>> and out) is in our test loop for quite a bit (probably 2+ years). >>>> >>>> Roman. >>> >>> ping? >> >> ping? > > ping? > Queued for 4.2. Paolo