[RFC,0/8] vhost: allow userspace to control vq cpu affinity

Message ID	1607068593-16932-1-git-send-email-michael.christie@oracle.com (mailing list archive)
Headers	show Return-Path: <target-devel-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: sgarzare@redhat.com, stefanha@redhat.com, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com, virtualization@lists.linux-foundation.org Subject: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity Date: Fri, 4 Dec 2020 01:56:25 -0600 Message-Id: <1607068593-16932-1-git-send-email-michael.christie@oracle.com> Precedence: bulk
Series	vhost: allow userspace to control vq cpu affinity \| expand [RFC,0/8] vhost: allow userspace to control vq cpu affinity [RFC,1/8] vhost: remove work arg from vhost_work_flush [RFC,2/8] vhost-scsi: remove extra flushes [RFC,3/8] vhost poll: fix coding style [RFC,4/8] vhost: move msg_handler to new ops struct [RFC,5/8] vhost: allow userspace to bind vqs to CPUs [RFC,6/8] vhost-scsi: make SCSI cmd completion per vq [RFC,7/8] vhost, vhost-scsi: flush IO vqs then send TMF rsp [RFC,8/8] vhost-scsi: hook vhost-scsi into vring set cpu support

Message ID

1607068593-16932-1-git-send-email-michael.christie@oracle.com (mailing list archive)

Headers

From: Mike Christie <michael.christie@oracle.com>
To: sgarzare@redhat.com, stefanha@redhat.com,
        linux-scsi@vger.kernel.org, target-devel@vger.kernel.org,
        mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com,
        virtualization@lists.linux-foundation.org
Subject: [RFC PATCH 0/8] vhost: allow userspace to control vq cpu affinity
Date: Fri,  4 Dec 2020 01:56:25 -0600
Message-Id: <1607068593-16932-1-git-send-email-michael.christie@oracle.com>
Precedence: bulk

Series

vhost: allow userspace to control vq cpu affinity | expand

Message

Mike Christie Dec. 4, 2020, 7:56 a.m. UTC

These patches were made over mst's vhost branch.

The following patches, made over mst's vhost branch, allow userspace
to set each vq's cpu affinity. Currently, with cgroups the worker thread
inherits the affinity settings, but we are at the mercy of the CPU
scheduler for where the vq's IO will be executed on. This can result in
the scheduler sometimes hammering a couple queues on the host instead of
spreading it out like how the guest's app might have intended if it was
mq aware.

This version of the patches is not what you guys were talking about
initially like with the interface that was similar to nbd's old
(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
kernel and we run from that context. These patches instead just
allow userspace to tell the kernel which CPU a vq should run on.
We then use the kernel's workqueue code to handle the thread
management.

I wanted to post this version first, because it is flexible
in that userspace can set things up so devs/vqs share threads/CPUs
and we don't have to worry about replicating a bunch of features
that the workqueue code already has like dynamic thread creation,
blocked work detection, idle thread detection and thread reaping,
and it also has an interface to control how many threads can be
created and which CPUs work can run on if we want to further restrict
that from userspace.

Note that these patches have been lightly tested. I more wanted
to get comments on the overall approach, because I know it's
not really what you were thinking about. But while I worked
on being able to share threads with multiple devices, I kept
coming back to the existing workqueue code and thinking I'll
just copy and paste that.

Comments

Stefano Garzarella Dec. 4, 2020, 4:06 p.m. UTC | #1

Hi Mike,

On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>These patches were made over mst's vhost branch.
>
>The following patches, made over mst's vhost branch, allow userspace
>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>inherits the affinity settings, but we are at the mercy of the CPU
>scheduler for where the vq's IO will be executed on. This can result in
>the scheduler sometimes hammering a couple queues on the host instead of
>spreading it out like how the guest's app might have intended if it was
>mq aware.
>
>This version of the patches is not what you guys were talking about
>initially like with the interface that was similar to nbd's old
>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>kernel and we run from that context. These patches instead just
>allow userspace to tell the kernel which CPU a vq should run on.
>We then use the kernel's workqueue code to handle the thread
>management.

I agree that reusing kernel's workqueue code would be a good strategy.

One concern is how easy it is to implement an adaptive polling strategy 
using workqueues. From what I've seen, adding some polling of both 
backend and virtqueue helps to eliminate interrupts and reduce latency.

Anyway, I'll take a closer look at your patches next week. :-)

Thanks,
Stefano

>
>I wanted to post this version first, because it is flexible
>in that userspace can set things up so devs/vqs share threads/CPUs
>and we don't have to worry about replicating a bunch of features
>that the workqueue code already has like dynamic thread creation,
>blocked work detection, idle thread detection and thread reaping,
>and it also has an interface to control how many threads can be
>created and which CPUs work can run on if we want to further restrict
>that from userspace.
>
>Note that these patches have been lightly tested. I more wanted
>to get comments on the overall approach, because I know it's
>not really what you were thinking about. But while I worked
>on being able to share threads with multiple devices, I kept
>coming back to the existing workqueue code and thinking I'll
>just copy and paste that.
>
>

Mike Christie Dec. 4, 2020, 5:10 p.m. UTC | #2

On 12/4/20 10:06 AM, Stefano Garzarella wrote:
> Hi Mike,
> 
> On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>> These patches were made over mst's vhost branch.
>>
>> The following patches, made over mst's vhost branch, allow userspace
>> to set each vq's cpu affinity. Currently, with cgroups the worker thread
>> inherits the affinity settings, but we are at the mercy of the CPU
>> scheduler for where the vq's IO will be executed on. This can result in
>> the scheduler sometimes hammering a couple queues on the host instead of
>> spreading it out like how the guest's app might have intended if it was
>> mq aware.
>>
>> This version of the patches is not what you guys were talking about
>> initially like with the interface that was similar to nbd's old
>> (3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>> kernel and we run from that context. These patches instead just
>> allow userspace to tell the kernel which CPU a vq should run on.
>> We then use the kernel's workqueue code to handle the thread
>> management.
> 
> I agree that reusing kernel's workqueue code would be a good strategy.
> 
> One concern is how easy it is to implement an adaptive polling strategy 
> using workqueues. From what I've seen, adding some polling of both 
> backend and virtqueue helps to eliminate interrupts and reduce latency.
> 
Would the polling you need to do be similar to the vhost net poll code 
like in vhost_net_busy_poll (different algorithm though)? But, we want 
to be able to poll multiple devs/vqs from the same CPU right? Something 
like:

retry:

for each poller on CPU N
	if poller has work
		driver->run work fn

if (poll limit hit)
	return
else
	cpu_relax();
goto retry:

?

If so, I had an idea for it. Let me send an additional patch on top of 
this set.

Mike Christie Dec. 4, 2020, 5:33 p.m. UTC | #3

On 12/4/20 11:10 AM, Mike Christie wrote:
> On 12/4/20 10:06 AM, Stefano Garzarella wrote:
>> Hi Mike,
>>
>> On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>>> These patches were made over mst's vhost branch.
>>>
>>> The following patches, made over mst's vhost branch, allow userspace
>>> to set each vq's cpu affinity. Currently, with cgroups the worker thread
>>> inherits the affinity settings, but we are at the mercy of the CPU
>>> scheduler for where the vq's IO will be executed on. This can result in
>>> the scheduler sometimes hammering a couple queues on the host instead of
>>> spreading it out like how the guest's app might have intended if it was
>>> mq aware.
>>>
>>> This version of the patches is not what you guys were talking about
>>> initially like with the interface that was similar to nbd's old
>>> (3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>>> kernel and we run from that context. These patches instead just
>>> allow userspace to tell the kernel which CPU a vq should run on.
>>> We then use the kernel's workqueue code to handle the thread
>>> management.
>>
>> I agree that reusing kernel's workqueue code would be a good strategy.
>>
>> One concern is how easy it is to implement an adaptive polling 
>> strategy using workqueues. From what I've seen, adding some polling of 
>> both backend and virtqueue helps to eliminate interrupts and reduce 
>> latency.
>>
> Would the polling you need to do be similar to the vhost net poll code 
> like in vhost_net_busy_poll (different algorithm though)? But, we want 
> to be able to poll multiple devs/vqs from the same CPU right? Something 
> like:
> 
> retry:
> 
> for each poller on CPU N
>      if poller has work
>          driver->run work fn
> 
> if (poll limit hit)
>      return
> else
>      cpu_relax();
> goto retry:
> 
> ?
> 
> If so, I had an idea for it. Let me send an additional patch on top of 
> this set.

Oh yeah, just to make sure I am on the same page for vdpa, because scsi 
and net work so differnetly.

Were you thinking that you would initially run from

vhost_poll_wakeup -> work->fn

then in the vdpa work->fn you would do the kick_vq still, but then also 
kick off a group backend/vq poller. This would then poll the vqs/devs 
that were bound to that CPU from the worker/wq thread.

So I was thinking you want something similar to network's NAPI. Here our 
work->fn is the hard irq, and then the worker is like their softirq we 
poll from.

Stefano Garzarella Dec. 9, 2020, 3:58 p.m. UTC | #4

Hi Mike,
sorry for the delay but there were holidays.

On Fri, Dec 04, 2020 at 11:33:11AM -0600, Mike Christie wrote:
>On 12/4/20 11:10 AM, Mike Christie wrote:
>>On 12/4/20 10:06 AM, Stefano Garzarella wrote:
>>>Hi Mike,
>>>
>>>On Fri, Dec 04, 2020 at 01:56:25AM -0600, Mike Christie wrote:
>>>>These patches were made over mst's vhost branch.
>>>>
>>>>The following patches, made over mst's vhost branch, allow userspace
>>>>to set each vq's cpu affinity. Currently, with cgroups the worker thread
>>>>inherits the affinity settings, but we are at the mercy of the CPU
>>>>scheduler for where the vq's IO will be executed on. This can result in
>>>>the scheduler sometimes hammering a couple queues on the host instead of
>>>>spreading it out like how the guest's app might have intended if it was
>>>>mq aware.
>>>>
>>>>This version of the patches is not what you guys were talking about
>>>>initially like with the interface that was similar to nbd's old
>>>>(3.x kernel days) NBD_DO_IT ioctl where userspace calls down to the
>>>>kernel and we run from that context. These patches instead just
>>>>allow userspace to tell the kernel which CPU a vq should run on.
>>>>We then use the kernel's workqueue code to handle the thread
>>>>management.
>>>
>>>I agree that reusing kernel's workqueue code would be a good strategy.
>>>
>>>One concern is how easy it is to implement an adaptive polling 
>>>strategy using workqueues. From what I've seen, adding some 
>>>polling of both backend and virtqueue helps to eliminate 
>>>interrupts and reduce latency.
>>>
>>Would the polling you need to do be similar to the vhost net poll 
>>code like in vhost_net_busy_poll (different algorithm though)? But, 
>>we want to be able to poll multiple devs/vqs from the same CPU 
>>right? Something like:
>>
>>retry:
>>
>>for each poller on CPU N
>>     if poller has work
>>         driver->run work fn
>>
>>if (poll limit hit)
>>     return
>>else
>>     cpu_relax();
>>goto retry:
>>
>>?

Yeah, something similar. IIUC vhost_net_busy_poll() polls both vring and 
backend (socket).

Maybe we need to limit the work->fn amount of work to avoid starvation.

>>
>>If so, I had an idea for it. Let me send an additional patch on top 
>>of this set.

Sure :-)

>
>Oh yeah, just to make sure I am on the same page for vdpa, because 
>scsi and net work so differnetly.
>
>Were you thinking that you would initially run from
>
>vhost_poll_wakeup -> work->fn
>
>then in the vdpa work->fn you would do the kick_vq still, but then 
>also kick off a group backend/vq poller. This would then poll the 
>vqs/devs that were bound to that CPU from the worker/wq thread.

Yes, this seams reasonable!

>
>So I was thinking you want something similar to network's NAPI. Here 

I don't know NAPI very well, but IIUC the goal is the same: try to avoid 
notifications (IRQs from the device, vm-exit from the guest) doing an 
adaptive polling.

>our work->fn is the hard irq, and then the worker is like their softirq 
>we poll from.
>

I'm a little lost here...

Thanks,
Stefano