mbox series

[V3,00/11] vhost: multiple worker support

Message ID 20211022051911.108383-1-michael.christie@oracle.com (mailing list archive)
Headers show
Series vhost: multiple worker support | expand

Message

Mike Christie Oct. 22, 2021, 5:18 a.m. UTC
The following patches apply over linus's tree and this patchset

https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/

which allows us to check the vhost owner thread's RLIMITs:

It looks like that patchset has been ok'd by all the major parties
and just needs a small cleanup to apply to Jens and Paul trees, so I
wanted to post my threading patches based over it for review.

The following patches allow us to support multiple vhost workers per
device. I ended up just doing Stefan's original idea where userspace has
the kernel create a worker and we pass back the pid. This has the benefit
over the workqueue and userspace thread approach where we only have
one'ish code path in the kernel during setup to detect old tools. The
main IO paths and device/vq setup/teardown paths all use common code.

I've also included a patch for qemu so you can get an idea of how it
works. If we are ok with the kernel code then I'll break that up into
a patchset and send to qemu-devel for review.

Results:
--------

fio jobs        1       2       4       8       12      16
----------------------------------------------------------
1 worker        84k    492k    510k    -       -       -
worker per vq   184k   380k    744k    1422k   2256k   2434k

Notes:
0. This used a simple fio command:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE

and I used a VM with 16 vCPUs and 16 virtqueues.

1. The patches were tested with emulate_pr=0 and these patches:

https://lore.kernel.org/all/yq1tuhge4bg.fsf@ca-mkp.ca.oracle.com/t/

which are in mkp's scsi branches for the next kernel. They fix the perf
issues where IOPs dropped at 12 vqs/jobs.

2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
and 16 used 64.

3. The perf issue above at 2 jobs is because when we only have 1 worker
we execute more cmds per vhost_work due to all vqs funneling to one worker.
This results in less context switches and better batching without having to
tweak any settings. I'm working on patches to add back batching during lio
completion and do polling on the submission side.

We will still want the threading patches, because if we batch at the fio
level plus use the vhost theading patches, we can see a big boost like
below. So hopefully doing it at the kernel will allow apps to just work
without having to be smart like fio.

fio using io_uring and batching with the iodepth_batch* settings:

fio jobs        1       2       4       8       12      16
-------------------------------------------------------------
1 worker        494k    520k    -       -       -       -
worker per vq   496k    878k    1542k   2436k   2304k   2590k

V3:
- fully convert vhost code to use vq based APIs instead of leaving it
half per dev and half per vq.
- rebase against kernel worker API.
- Drop delayed worker creation. We always create the default worker at
VHOST_SET_OWNER time. Userspace can create and bind workers after that.

v2:
- change loop that we take a refcount to the worker in
- replaced pid == -1 with define.
- fixed tabbing/spacing coding style issue
- use hash instead of list to lookup workers.
- I dropped the patch that added an ioctl cmd to get a vq's worker's
pid. I saw we might do a generic netlink interface instead.

Comments

Mike Christie Oct. 22, 2021, 6:02 a.m. UTC | #1
On 10/22/21 12:18 AM, Mike Christie wrote:
> Results:
> --------
> 
> fio jobs        1       2       4       8       12      16
> ----------------------------------------------------------
> 1 worker        84k    492k    510k    -       -       -

That should be 184k above.

> worker per vq   184k   380k    744k    1422k   2256k   2434k
Michael S. Tsirkin Oct. 22, 2021, 9:48 a.m. UTC | #2
On Fri, Oct 22, 2021 at 12:18:59AM -0500, Mike Christie wrote:
> The following patches apply over linus's tree and this patchset
> 
> https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/
> 
> which allows us to check the vhost owner thread's RLIMITs:


Unfortunately that patchset in turn triggers kbuild warnings.
I was hoping you would address them, I don't think
merging that patchset before kbuild issues are addressed
is possible.

It also doesn't have lots of acks, I'm a bit apprehensive
of merging core changes like this through the vhost tree.
Try to CC more widely/ping people?

> It looks like that patchset has been ok'd by all the major parties
> and just needs a small cleanup to apply to Jens and Paul trees, so I
> wanted to post my threading patches based over it for review.
> 
> The following patches allow us to support multiple vhost workers per
> device. I ended up just doing Stefan's original idea where userspace has
> the kernel create a worker and we pass back the pid. This has the benefit
> over the workqueue and userspace thread approach where we only have
> one'ish code path in the kernel during setup to detect old tools. The
> main IO paths and device/vq setup/teardown paths all use common code.
> 
> I've also included a patch for qemu so you can get an idea of how it
> works. If we are ok with the kernel code then I'll break that up into
> a patchset and send to qemu-devel for review.
> 
> Results:
> --------
> 
> fio jobs        1       2       4       8       12      16
> ----------------------------------------------------------
> 1 worker        84k    492k    510k    -       -       -
> worker per vq   184k   380k    744k    1422k   2256k   2434k
> 
> Notes:
> 0. This used a simple fio command:
> 
> fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
> --ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE
> 
> and I used a VM with 16 vCPUs and 16 virtqueues.
> 
> 1. The patches were tested with emulate_pr=0 and these patches:
> 
> https://lore.kernel.org/all/yq1tuhge4bg.fsf@ca-mkp.ca.oracle.com/t/
> 
> which are in mkp's scsi branches for the next kernel. They fix the perf
> issues where IOPs dropped at 12 vqs/jobs.
> 
> 2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
> was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
> and 16 used 64.
> 
> 3. The perf issue above at 2 jobs is because when we only have 1 worker
> we execute more cmds per vhost_work due to all vqs funneling to one worker.
> This results in less context switches and better batching without having to
> tweak any settings. I'm working on patches to add back batching during lio
> completion and do polling on the submission side.
> 
> We will still want the threading patches, because if we batch at the fio
> level plus use the vhost theading patches, we can see a big boost like
> below. So hopefully doing it at the kernel will allow apps to just work
> without having to be smart like fio.
> 
> fio using io_uring and batching with the iodepth_batch* settings:
> 
> fio jobs        1       2       4       8       12      16
> -------------------------------------------------------------
> 1 worker        494k    520k    -       -       -       -
> worker per vq   496k    878k    1542k   2436k   2304k   2590k
> 
> V3:
> - fully convert vhost code to use vq based APIs instead of leaving it
> half per dev and half per vq.
> - rebase against kernel worker API.
> - Drop delayed worker creation. We always create the default worker at
> VHOST_SET_OWNER time. Userspace can create and bind workers after that.
> 
> v2:
> - change loop that we take a refcount to the worker in
> - replaced pid == -1 with define.
> - fixed tabbing/spacing coding style issue
> - use hash instead of list to lookup workers.
> - I dropped the patch that added an ioctl cmd to get a vq's worker's
> pid. I saw we might do a generic netlink interface instead.
> 
> 
>
Michael S. Tsirkin Oct. 22, 2021, 9:49 a.m. UTC | #3
On Fri, Oct 22, 2021 at 01:02:19AM -0500, michael.christie@oracle.com wrote:
> On 10/22/21 12:18 AM, Mike Christie wrote:
> > Results:
> > --------
> > 
> > fio jobs        1       2       4       8       12      16
> > ----------------------------------------------------------
> > 1 worker        84k    492k    510k    -       -       -
> 
> That should be 184k above.

Nice. I'd like to merge this but blocked because of a dependency
(since we can't allow userspace to create threads without any limit).

> > worker per vq   184k   380k    744k    1422k   2256k   2434k
Mike Christie Oct. 22, 2021, 3:54 p.m. UTC | #4
Ccing Christian for the kernel worker API merging stuff.

On 10/22/21 4:48 AM, Michael S. Tsirkin wrote:
> On Fri, Oct 22, 2021 at 12:18:59AM -0500, Mike Christie wrote:
>> The following patches apply over linus's tree and this patchset
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/__;!!ACWV5N9M2RV99hQ!aqbE06mycEW-AMIj5avlBMDSvg2FONlNdYHr8PcNKdvl5FeO4QLCxCOyaVg8g8C2_Kp5$ 
>>
>> which allows us to check the vhost owner thread's RLIMITs:
> 
> 
> Unfortunately that patchset in turn triggers kbuild warnings.

Yeah, that's the Jens/Paul issue I mentioned. I have to remove the
old create_io_thread code and resolve issues with their trees. Paul's
tree has a conflict with Jens and then my patch has a issue with Paul's
patches.

So Christian and I thought we would re-push the patchset through
Christian after that has settled in 5.16-rc1 and then shoot for 5.17
so it has time to bake in next.


> I was hoping you would address them, I don't think
> merging that patchset before kbuild issues are addressed
> is possible.
> 
> It also doesn't have lots of acks, I'm a bit apprehensive
> of merging core changes like this through the vhost tree.

Ok. Just to make sure we are on the same page. Christian was going to
push the kernel worker API changes.

> Try to CC more widely/ping people?
Michael S. Tsirkin Oct. 23, 2021, 8:12 p.m. UTC | #5
On Fri, Oct 22, 2021 at 10:54:24AM -0500, michael.christie@oracle.com wrote:
> Ccing Christian for the kernel worker API merging stuff.
> 
> On 10/22/21 4:48 AM, Michael S. Tsirkin wrote:
> > On Fri, Oct 22, 2021 at 12:18:59AM -0500, Mike Christie wrote:
> >> The following patches apply over linus's tree and this patchset
> >>
> >> https://urldefense.com/v3/__https://lore.kernel.org/all/20211007214448.6282-1-michael.christie@oracle.com/__;!!ACWV5N9M2RV99hQ!aqbE06mycEW-AMIj5avlBMDSvg2FONlNdYHr8PcNKdvl5FeO4QLCxCOyaVg8g8C2_Kp5$ 
> >>
> >> which allows us to check the vhost owner thread's RLIMITs:
> > 
> > 
> > Unfortunately that patchset in turn triggers kbuild warnings.
> 
> Yeah, that's the Jens/Paul issue I mentioned. I have to remove the
> old create_io_thread code and resolve issues with their trees. Paul's
> tree has a conflict with Jens and then my patch has a issue with Paul's
> patches.
> 
> So Christian and I thought we would re-push the patchset through
> Christian after that has settled in 5.16-rc1 and then shoot for 5.17
> so it has time to bake in next.
> 

Sounds good to me.

> > I was hoping you would address them, I don't think
> > merging that patchset before kbuild issues are addressed
> > is possible.
> > 
> > It also doesn't have lots of acks, I'm a bit apprehensive
> > of merging core changes like this through the vhost tree.
> 
> Ok. Just to make sure we are on the same page. Christian was going to
> push the kernel worker API changes.

Fine.

> > Try to CC more widely/ping people?