[07/17] vhost scsi: support delayed IO vq creation

Message ID	1603326903-27052-8-git-send-email-michael.christie@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=TeMy=D5=vger.kernel.org=target-devel-owner@kernel.org> From: Mike Christie <michael.christie@oracle.com> To: martin.petersen@oracle.com, linux-scsi@vger.kernel.org, target-devel@vger.kernel.org, mst@redhat.com, jasowang@redhat.com, pbonzini@redhat.com, stefanha@redhat.com, virtualization@lists.linux-foundation.org Subject: [PATCH 07/17] vhost scsi: support delayed IO vq creation Date: Wed, 21 Oct 2020 19:34:53 -0500 Message-Id: <1603326903-27052-8-git-send-email-michael.christie@oracle.com> In-Reply-To: <1603326903-27052-1-git-send-email-michael.christie@oracle.com> References: <1603326903-27052-1-git-send-email-michael.christie@oracle.com> Precedence: bulk
Series	vhost: fix scsi cmd handling and cgroup support \| expand [00/17,V3] vhost: fix scsi cmd handling and cgroup support [01/17] vhost scsi: add lun parser helper [02/17] vhost: remove work arg from vhost_work_flush [03/17] vhost net: use goto error handling in open [04/17] vhost: prep vhost_dev_init users to handle failures [05/17] vhost: move vq iovec allocation to dev init time [06/17] vhost: support delayed vq creation [07/17] vhost scsi: support delayed IO vq creation [08/17] vhost scsi: alloc cmds per vq instead of session [09/17] vhost scsi: fix cmd completion race [10/17] vhost scsi: Add support for LUN resets. [11/17] vhost scsi: remove extra flushes [12/17] vhost poll: fix coding style [13/17] vhost: support multiple worker threads [14/17] vhost: poll support support multiple workers [15/17] host: support delayed vq creation [16/17] vhost scsi: multiple worker support [17/17] vhost scsi: drop submission workqueue

Mike Christie Oct. 22, 2020, 12:34 a.m. UTC

Each vhost-scsi device will need a evt and ctl queue, but the number
of IO queues depends on whatever the user has configured in userspace.
This patch has vhost-scsi create the evt, ctl and one IO vq at device
open time. We then create the other IO vqs when userspace starts to
set them up. We still waste some mem on the vq and scsi vq structs,
but we don't waste mem on iovec related arrays and for later patches
we know which queues are used by the dev->nvqs value.

Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/vhost/scsi.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

Jason Wang Oct. 26, 2020, 3:51 a.m. UTC | #1

On 2020/10/22 上午8:34, Mike Christie wrote:
> Each vhost-scsi device will need a evt and ctl queue, but the number
> of IO queues depends on whatever the user has configured in userspace.
> This patch has vhost-scsi create the evt, ctl and one IO vq at device
> open time. We then create the other IO vqs when userspace starts to
> set them up. We still waste some mem on the vq and scsi vq structs,
> but we don't waste mem on iovec related arrays and for later patches
> we know which queues are used by the dev->nvqs value.
>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>   drivers/vhost/scsi.c | 19 +++++++++++++++----
>   1 file changed, 15 insertions(+), 4 deletions(-)


Not familiar with SCSI. But I wonder if it could behave like vhost-net.

E.g userspace should known the number of virtqueues so it can just open 
and close multiple vhost-scsi file descriptors.

Thanks

Mike Christie Oct. 27, 2020, 5:47 a.m. UTC | #2

On 10/25/20 10:51 PM, Jason Wang wrote:
> 
> On 2020/10/22 上午8:34, Mike Christie wrote:
>> Each vhost-scsi device will need a evt and ctl queue, but the number
>> of IO queues depends on whatever the user has configured in userspace.
>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>> open time. We then create the other IO vqs when userspace starts to
>> set them up. We still waste some mem on the vq and scsi vq structs,
>> but we don't waste mem on iovec related arrays and for later patches
>> we know which queues are used by the dev->nvqs value.
>>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> ---
>>   drivers/vhost/scsi.c | 19 +++++++++++++++----
>>   1 file changed, 15 insertions(+), 4 deletions(-)
> 
> 
> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
> 
> E.g userspace should known the number of virtqueues so it can just open 
> and close multiple vhost-scsi file descriptors.
> 

One hiccup I'm hitting is that we might end up creating about 3x more 
vqs than we need. The problem is that for scsi each vhost device has:

vq=0: special control vq
vq=1: event vq
vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.

Today we do:

Uerspace does open(/dev/vhost-scsi)
         vhost_dev_init(create 128 vqs and then later we setup and use N 
of them);

Qemu does ioctl(VHOST_SET_OWNER)
         vhost_dev_set_owner()

For N vqs userspace does:
         // virtqueue setup related ioctls

Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
         - match LIO/target port to vhost_dev

So we could change that to:

For N IO vqs userspace does
         open(/dev/vhost-scsi)
                 vhost_dev_init(create IO, evt, and ctl);

for N IO vqs Qemu does:
         ioctl(VHOST_SET_OWNER)
                 vhost_dev_set_owner()

for N IO vqs Qemu does:
         // virtqueue setup related ioctls

for N IO vqs Qemu does:
         ioctl(VHOST_SCSI_SET_ENDPOINT)
                 - match LIO/target port to vhost_dev and assemble the 
multiple vhost_dev device.

The problem is that we have to setup some of the evt/ctl specific parts 
at open() time when vhost_dev_init does vhost_poll_init for example.

- At open time, we don't know if this vhost_dev is going to be part of a 
multiple vhost_device device or a single one so we need to create at 
least 3 of them
- If it is a multiple device we don't know if its the first device being 
created for the device or the N'th, so we don't know if the dev's vqs 
will be used for IO or ctls/evts, so we have to create all 3.

When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style 
multiple vhost_dev device, we can use that dev's evt/ctl vqs for 
events/controls requests. When we get the other VHOST_SCSI_SET_ENDPOINT 
calls for the multiple vhost_dev device then those dev's evt/ctl vqs 
will be ignored and we will only use their IO vqs. So we end up with a 
lot of extra vqs.

One other question/issue I have is that qemu can open the 
/dev/vhost-scsi device or it allows tools like libvirtd to open the 
device and pass in the fd to use. For the latter case, would we continue 
to have those tools pass in the leading fd, then have qemu do the other 
num_queues - 1 open(/dev/vhost-scsi) calls? Or do these apps that pass 
in the fd need to know about all of the fds for some management reason?

Jason Wang Oct. 28, 2020, 1:55 a.m. UTC | #3

On 2020/10/27 下午1:47, Mike Christie wrote:
> On 10/25/20 10:51 PM, Jason Wang wrote:
>>
>> On 2020/10/22 上午8:34, Mike Christie wrote:
>>> Each vhost-scsi device will need a evt and ctl queue, but the number
>>> of IO queues depends on whatever the user has configured in userspace.
>>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>>> open time. We then create the other IO vqs when userspace starts to
>>> set them up. We still waste some mem on the vq and scsi vq structs,
>>> but we don't waste mem on iovec related arrays and for later patches
>>> we know which queues are used by the dev->nvqs value.
>>>
>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>> ---
>>>   drivers/vhost/scsi.c | 19 +++++++++++++++----
>>>   1 file changed, 15 insertions(+), 4 deletions(-)
>>
>>
>> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
>>
>> E.g userspace should known the number of virtqueues so it can just 
>> open and close multiple vhost-scsi file descriptors.
>>
>
> One hiccup I'm hitting is that we might end up creating about 3x more 
> vqs than we need. The problem is that for scsi each vhost device has:
>
> vq=0: special control vq
> vq=1: event vq
> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
>
> Today we do:
>
> Uerspace does open(/dev/vhost-scsi)
>         vhost_dev_init(create 128 vqs and then later we setup and use 
> N of them);
>
> Qemu does ioctl(VHOST_SET_OWNER)
>         vhost_dev_set_owner()
>
> For N vqs userspace does:
>         // virtqueue setup related ioctls
>
> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>         - match LIO/target port to vhost_dev
>
>
> So we could change that to:
>
> For N IO vqs userspace does
>         open(/dev/vhost-scsi)
>                 vhost_dev_init(create IO, evt, and ctl);
>
> for N IO vqs Qemu does:
>         ioctl(VHOST_SET_OWNER)
>                 vhost_dev_set_owner()
>
> for N IO vqs Qemu does:
>         // virtqueue setup related ioctls
>
> for N IO vqs Qemu does:
>         ioctl(VHOST_SCSI_SET_ENDPOINT)
>                 - match LIO/target port to vhost_dev and assemble the 
> multiple vhost_dev device.
>
> The problem is that we have to setup some of the evt/ctl specific 
> parts at open() time when vhost_dev_init does vhost_poll_init for 
> example.
>
> - At open time, we don't know if this vhost_dev is going to be part of 
> a multiple vhost_device device or a single one so we need to create at 
> least 3 of them
> - If it is a multiple device we don't know if its the first device 
> being created for the device or the N'th, so we don't know if the 
> dev's vqs will be used for IO or ctls/evts, so we have to create all 3.
>
> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style 
> multiple vhost_dev device, we can use that dev's evt/ctl vqs for 
> events/controls requests. When we get the other 
> VHOST_SCSI_SET_ENDPOINT calls for the multiple vhost_dev device then 
> those dev's evt/ctl vqs will be ignored and we will only use their IO 
> vqs. So we end up with a lot of extra vqs.


Right, so in this case we can use this patch to address this issue 
probably. If evt/ctl vq is not used, we won't even create them.


>
>
> One other question/issue I have is that qemu can open the 
> /dev/vhost-scsi device or it allows tools like libvirtd to open the 
> device and pass in the fd to use.


It allows libvirt to open and pass fds to qemu. This is how multie-queue 
virtio-net is done, libvirt is in charge of opening multiple file 
descriptors and pass them to qemu.


> For the latter case, would we continue to have those tools pass in the 
> leading fd, then have qemu do the other num_queues - 1 
> open(/dev/vhost-scsi) calls? Or do these apps that pass in the fd need 
> to know about all of the fds for some management reason?


Usually qemu is running without privilege. So it depends on the 
management to open the device.

Note that I'm not object your proposal, just want to see if it could be 
done via a more easy way. During the development if multiqueue 
virito-net, something similar as you've done was proposed but we end up 
with the multiple vhost-net fd model which keeps kernel code unchanged.

Thanks

Michael S. Tsirkin Oct. 30, 2020, 8:47 a.m. UTC | #4

On Tue, Oct 27, 2020 at 12:47:34AM -0500, Mike Christie wrote:
> On 10/25/20 10:51 PM, Jason Wang wrote:
> > 
> > On 2020/10/22 上午8:34, Mike Christie wrote:
> > > Each vhost-scsi device will need a evt and ctl queue, but the number
> > > of IO queues depends on whatever the user has configured in userspace.
> > > This patch has vhost-scsi create the evt, ctl and one IO vq at device
> > > open time. We then create the other IO vqs when userspace starts to
> > > set them up. We still waste some mem on the vq and scsi vq structs,
> > > but we don't waste mem on iovec related arrays and for later patches
> > > we know which queues are used by the dev->nvqs value.
> > > 
> > > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> > > ---
> > >   drivers/vhost/scsi.c | 19 +++++++++++++++----
> > >   1 file changed, 15 insertions(+), 4 deletions(-)
> > 
> > 
> > Not familiar with SCSI. But I wonder if it could behave like vhost-net.
> > 
> > E.g userspace should known the number of virtqueues so it can just open
> > and close multiple vhost-scsi file descriptors.
> > 
> 
> One hiccup I'm hitting is that we might end up creating about 3x more vqs
> than we need. The problem is that for scsi each vhost device has:
> 
> vq=0: special control vq
> vq=1: event vq
> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
> 
> Today we do:
> 
> Uerspace does open(/dev/vhost-scsi)
>         vhost_dev_init(create 128 vqs and then later we setup and use N of
> them);
> 
> Qemu does ioctl(VHOST_SET_OWNER)
>         vhost_dev_set_owner()
> 
> For N vqs userspace does:
>         // virtqueue setup related ioctls
> 
> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>         - match LIO/target port to vhost_dev
> 
> 
> So we could change that to:
> 
> For N IO vqs userspace does
>         open(/dev/vhost-scsi)
>                 vhost_dev_init(create IO, evt, and ctl);
> 
> for N IO vqs Qemu does:
>         ioctl(VHOST_SET_OWNER)
>                 vhost_dev_set_owner()
> 
> for N IO vqs Qemu does:
>         // virtqueue setup related ioctls
> 
> for N IO vqs Qemu does:
>         ioctl(VHOST_SCSI_SET_ENDPOINT)
>                 - match LIO/target port to vhost_dev and assemble the
> multiple vhost_dev device.
> 
> The problem is that we have to setup some of the evt/ctl specific parts at
> open() time when vhost_dev_init does vhost_poll_init for example.
> 
> - At open time, we don't know if this vhost_dev is going to be part of a
> multiple vhost_device device or a single one so we need to create at least 3
> of them
> - If it is a multiple device we don't know if its the first device being
> created for the device or the N'th, so we don't know if the dev's vqs will
> be used for IO or ctls/evts, so we have to create all 3.
> 
> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style multiple
> vhost_dev device, we can use that dev's evt/ctl vqs for events/controls
> requests. When we get the other VHOST_SCSI_SET_ENDPOINT calls for the
> multiple vhost_dev device then those dev's evt/ctl vqs will be ignored and
> we will only use their IO vqs. So we end up with a lot of extra vqs.

The issue Jason's hinting at is how can admins control the amount
of resources a given qemu instance can consume?
After all vhost vqs all live in host kernel memory ...
Limiting # of open fds would be one way to do that ...

The need to share event/control vqs between devices is a problem though,
and sending lots of ioctls on things like reset is also not that elegant.
Jason, did you have a good solution in mind?

> One other question/issue I have is that qemu can open the /dev/vhost-scsi
> device or it allows tools like libvirtd to open the device and pass in the
> fd to use. For the latter case, would we continue to have those tools pass
> in the leading fd, then have qemu do the other num_queues - 1
> open(/dev/vhost-scsi) calls? Or do these apps that pass in the fd need to
> know about all of the fds for some management reason?

They know about all the fds, for resource control and priveledge
separation reasons.

Mike Christie Oct. 30, 2020, 4:30 p.m. UTC | #5

On 10/30/20 3:47 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 27, 2020 at 12:47:34AM -0500, Mike Christie wrote:
>> On 10/25/20 10:51 PM, Jason Wang wrote:
>>>
>>> On 2020/10/22 上午8:34, Mike Christie wrote:
>>>> Each vhost-scsi device will need a evt and ctl queue, but the number
>>>> of IO queues depends on whatever the user has configured in userspace.
>>>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>>>> open time. We then create the other IO vqs when userspace starts to
>>>> set them up. We still waste some mem on the vq and scsi vq structs,
>>>> but we don't waste mem on iovec related arrays and for later patches
>>>> we know which queues are used by the dev->nvqs value.
>>>>
>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>> ---
>>>>    drivers/vhost/scsi.c | 19 +++++++++++++++----
>>>>    1 file changed, 15 insertions(+), 4 deletions(-)
>>>
>>>
>>> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
>>>
>>> E.g userspace should known the number of virtqueues so it can just open
>>> and close multiple vhost-scsi file descriptors.
>>>
>>
>> One hiccup I'm hitting is that we might end up creating about 3x more vqs
>> than we need. The problem is that for scsi each vhost device has:
>>
>> vq=0: special control vq
>> vq=1: event vq
>> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
>>
>> Today we do:
>>
>> Uerspace does open(/dev/vhost-scsi)
>>          vhost_dev_init(create 128 vqs and then later we setup and use N of
>> them);
>>
>> Qemu does ioctl(VHOST_SET_OWNER)
>>          vhost_dev_set_owner()
>>
>> For N vqs userspace does:
>>          // virtqueue setup related ioctls
>>
>> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>>          - match LIO/target port to vhost_dev
>>
>>
>> So we could change that to:
>>
>> For N IO vqs userspace does
>>          open(/dev/vhost-scsi)
>>                  vhost_dev_init(create IO, evt, and ctl);
>>
>> for N IO vqs Qemu does:
>>          ioctl(VHOST_SET_OWNER)
>>                  vhost_dev_set_owner()
>>
>> for N IO vqs Qemu does:
>>          // virtqueue setup related ioctls
>>
>> for N IO vqs Qemu does:
>>          ioctl(VHOST_SCSI_SET_ENDPOINT)
>>                  - match LIO/target port to vhost_dev and assemble the
>> multiple vhost_dev device.
>>
>> The problem is that we have to setup some of the evt/ctl specific parts at
>> open() time when vhost_dev_init does vhost_poll_init for example.
>>
>> - At open time, we don't know if this vhost_dev is going to be part of a
>> multiple vhost_device device or a single one so we need to create at least 3
>> of them
>> - If it is a multiple device we don't know if its the first device being
>> created for the device or the N'th, so we don't know if the dev's vqs will
>> be used for IO or ctls/evts, so we have to create all 3.
>>
>> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style multiple
>> vhost_dev device, we can use that dev's evt/ctl vqs for events/controls
>> requests. When we get the other VHOST_SCSI_SET_ENDPOINT calls for the
>> multiple vhost_dev device then those dev's evt/ctl vqs will be ignored and
>> we will only use their IO vqs. So we end up with a lot of extra vqs.
> 
> The issue Jason's hinting at is how can admins control the amount
> of resources a given qemu instance can consume?
> After all vhost vqs all live in host kernel memory ...
> Limiting # of open fds would be one way to do that ...

If I understand you, then the answer is vhost scsi has a setting 
num_queues already that controls the number of vqs. The upstream 
kernel's vhost scsi driver and qemu's vhost scsi code support multiqueue 
today. To enable it, the admin is setting the qemu property num_queues 
(qemu/hw/scsi/host-scsi.c). In the current code, we are already doing 
what I described in "Today we do:".

In the second chunk of patches (patches 13 - 16) I'm just trying to make 
it so vhost-scsi gets a thread per IO vq.

Patch 17 then fixes up the cgroup support so the user can control the IO 
vqs with cgroups. Today for vhost scsi the vhost work thread takes the 
request from the vq, then passes it to a workqueue_struct workqueue to 
submit it to the block layer. So today we are putting the vhost work 
thread in the cgroup, but it's a different thread interacting with the 
block layer, and the cgroup settings/limits are not applying.


> 
> The need to share event/control vqs between devices is a problem though,
> and sending lots of ioctls on things like reset is also not that elegant.
> Jason, did you have a good solution in mind?
> 
>> One other question/issue I have is that qemu can open the /dev/vhost-scsi
>> device or it allows tools like libvirtd to open the device and pass in the
>> fd to use. For the latter case, would we continue to have those tools pass
>> in the leading fd, then have qemu do the other num_queues - 1
>> open(/dev/vhost-scsi) calls? Or do these apps that pass in the fd need to
>> know about all of the fds for some management reason?
> 
> They know about all the fds, for resource control and priveledge
> separation reasons.
>

Mike Christie Oct. 30, 2020, 5:26 p.m. UTC | #6

On 10/30/20 11:30 AM, Mike Christie wrote:
> On 10/30/20 3:47 AM, Michael S. Tsirkin wrote:
>> On Tue, Oct 27, 2020 at 12:47:34AM -0500, Mike Christie wrote:
>>> On 10/25/20 10:51 PM, Jason Wang wrote:
>>>>
>>>> On 2020/10/22 上午8:34, Mike Christie wrote:
>>>>> Each vhost-scsi device will need a evt and ctl queue, but the number
>>>>> of IO queues depends on whatever the user has configured in userspace.
>>>>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>>>>> open time. We then create the other IO vqs when userspace starts to
>>>>> set them up. We still waste some mem on the vq and scsi vq structs,
>>>>> but we don't waste mem on iovec related arrays and for later patches
>>>>> we know which queues are used by the dev->nvqs value.
>>>>>
>>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>>> ---
>>>>>    drivers/vhost/scsi.c | 19 +++++++++++++++----
>>>>>    1 file changed, 15 insertions(+), 4 deletions(-)
>>>>
>>>>
>>>> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
>>>>
>>>> E.g userspace should known the number of virtqueues so it can just open
>>>> and close multiple vhost-scsi file descriptors.
>>>>
>>>
>>> One hiccup I'm hitting is that we might end up creating about 3x more 
>>> vqs
>>> than we need. The problem is that for scsi each vhost device has:
>>>
>>> vq=0: special control vq
>>> vq=1: event vq
>>> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
>>>
>>> Today we do:
>>>
>>> Uerspace does open(/dev/vhost-scsi)
>>>          vhost_dev_init(create 128 vqs and then later we setup and 
>>> use N of
>>> them);
>>>
>>> Qemu does ioctl(VHOST_SET_OWNER)
>>>          vhost_dev_set_owner()
>>>
>>> For N vqs userspace does:
>>>          // virtqueue setup related ioctls
>>>
>>> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>>>          - match LIO/target port to vhost_dev
>>>
>>>
>>> So we could change that to:
>>>
>>> For N IO vqs userspace does
>>>          open(/dev/vhost-scsi)
>>>                  vhost_dev_init(create IO, evt, and ctl);
>>>
>>> for N IO vqs Qemu does:
>>>          ioctl(VHOST_SET_OWNER)
>>>                  vhost_dev_set_owner()
>>>
>>> for N IO vqs Qemu does:
>>>          // virtqueue setup related ioctls
>>>
>>> for N IO vqs Qemu does:
>>>          ioctl(VHOST_SCSI_SET_ENDPOINT)
>>>                  - match LIO/target port to vhost_dev and assemble the
>>> multiple vhost_dev device.
>>>
>>> The problem is that we have to setup some of the evt/ctl specific 
>>> parts at
>>> open() time when vhost_dev_init does vhost_poll_init for example.
>>>
>>> - At open time, we don't know if this vhost_dev is going to be part of a
>>> multiple vhost_device device or a single one so we need to create at 
>>> least 3
>>> of them
>>> - If it is a multiple device we don't know if its the first device being
>>> created for the device or the N'th, so we don't know if the dev's vqs 
>>> will
>>> be used for IO or ctls/evts, so we have to create all 3.
>>>
>>> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style 
>>> multiple
>>> vhost_dev device, we can use that dev's evt/ctl vqs for events/controls
>>> requests. When we get the other VHOST_SCSI_SET_ENDPOINT calls for the
>>> multiple vhost_dev device then those dev's evt/ctl vqs will be 
>>> ignored and
>>> we will only use their IO vqs. So we end up with a lot of extra vqs.
>>
>> The issue Jason's hinting at is how can admins control the amount
>> of resources a given qemu instance can consume?
>> After all vhost vqs all live in host kernel memory ...
>> Limiting # of open fds would be one way to do that ...
> 
> If I understand you, then the answer is vhost scsi has a setting 
> num_queues already that controls the number of vqs. The upstream 
> kernel's vhost scsi driver and qemu's vhost scsi code support multiqueue 
> today. To enable it, the admin is setting the qemu property num_queues 
> (qemu/hw/scsi/host-scsi.c). In the current code, we are already doing 
> what I described in "Today we do:".
> 
> In the second chunk of patches (patches 13 - 16) I'm just trying to make 
> it so vhost-scsi gets a thread per IO vq.
> 
> Patch 17 then fixes up the cgroup support so the user can control the IO 
> vqs with cgroups. Today for vhost scsi the vhost work thread takes the 
> request from the vq, then passes it to a workqueue_struct workqueue to 
> submit it to the block layer. So today we are putting the vhost work 
> thread in the cgroup, but it's a different thread interacting with the 
> block layer, and the cgroup settings/limits are not applying.
> 

Ah, I think I did misundestand you. Today, you can set the fd limit to N 
and that would limit the total number of devices. But right now the user 
can set each of those N device's to have anywhere from num_queues=1 - 
128 which could be a wide range of resource use. You want something 
finer grained right?

Mike Christie Nov. 1, 2020, 10:06 p.m. UTC | #7

On 10/30/20 3:47 AM, Michael S. Tsirkin wrote:
> On Tue, Oct 27, 2020 at 12:47:34AM -0500, Mike Christie wrote:
>> On 10/25/20 10:51 PM, Jason Wang wrote:
>>>
>>> On 2020/10/22 上午8:34, Mike Christie wrote:
>>>> Each vhost-scsi device will need a evt and ctl queue, but the number
>>>> of IO queues depends on whatever the user has configured in userspace.
>>>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>>>> open time. We then create the other IO vqs when userspace starts to
>>>> set them up. We still waste some mem on the vq and scsi vq structs,
>>>> but we don't waste mem on iovec related arrays and for later patches
>>>> we know which queues are used by the dev->nvqs value.
>>>>
>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>> ---
>>>>   drivers/vhost/scsi.c | 19 +++++++++++++++----
>>>>   1 file changed, 15 insertions(+), 4 deletions(-)
>>>
>>>
>>> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
>>>
>>> E.g userspace should known the number of virtqueues so it can just open
>>> and close multiple vhost-scsi file descriptors.
>>>
>>
>> One hiccup I'm hitting is that we might end up creating about 3x more vqs
>> than we need. The problem is that for scsi each vhost device has:
>>
>> vq=0: special control vq
>> vq=1: event vq
>> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
>>
>> Today we do:
>>
>> Uerspace does open(/dev/vhost-scsi)
>>         vhost_dev_init(create 128 vqs and then later we setup and use N of
>> them);
>>
>> Qemu does ioctl(VHOST_SET_OWNER)
>>         vhost_dev_set_owner()
>>
>> For N vqs userspace does:
>>         // virtqueue setup related ioctls
>>
>> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>>         - match LIO/target port to vhost_dev
>>
>>
>> So we could change that to:
>>
>> For N IO vqs userspace does
>>         open(/dev/vhost-scsi)
>>                 vhost_dev_init(create IO, evt, and ctl);
>>
>> for N IO vqs Qemu does:
>>         ioctl(VHOST_SET_OWNER)
>>                 vhost_dev_set_owner()
>>
>> for N IO vqs Qemu does:
>>         // virtqueue setup related ioctls
>>
>> for N IO vqs Qemu does:
>>         ioctl(VHOST_SCSI_SET_ENDPOINT)
>>                 - match LIO/target port to vhost_dev and assemble the
>> multiple vhost_dev device.
>>
>> The problem is that we have to setup some of the evt/ctl specific parts at
>> open() time when vhost_dev_init does vhost_poll_init for example.
>>
>> - At open time, we don't know if this vhost_dev is going to be part of a
>> multiple vhost_device device or a single one so we need to create at least 3
>> of them
>> - If it is a multiple device we don't know if its the first device being
>> created for the device or the N'th, so we don't know if the dev's vqs will
>> be used for IO or ctls/evts, so we have to create all 3.
>>
>> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style multiple
>> vhost_dev device, we can use that dev's evt/ctl vqs for events/controls
>> requests. When we get the other VHOST_SCSI_SET_ENDPOINT calls for the
>> multiple vhost_dev device then those dev's evt/ctl vqs will be ignored and
>> we will only use their IO vqs. So we end up with a lot of extra vqs.
> 
> The issue Jason's hinting at is how can admins control the amount
> of resources a given qemu instance can consume?
> After all vhost vqs all live in host kernel memory ...
> Limiting # of open fds would be one way to do that ...
> 
> The need to share event/control vqs between devices is a problem though,
> and sending lots of ioctls on things like reset is also not that elegant.
> Jason, did you have a good solution in mind?
> 

Hey, so here is a prototype/outline of how we could add support for the
multiple device approach and keep compat support for the existing single
device multiple vq code. And, for the new style multiple dev approach we
keep the vq allocation to a minimum.

This patch was made over patches 0 - 11 in this patchset, but do not waste
your time reviewing this patch line by line. It's still really broken :) It
should give you an idea of what I was saying above about the evt/ctl queue
issue and give you an idea of how ugly/nice it is vs the patches 12 - 16
in this set.

--------------

In this patch we add a new struct vhost_scsi_md that represents multiple
vhost_scsi devices that are being combined to make one device.

Userspace signals the kernel it supports the new md approach by writing to
a new mod param vhost_scsi_multi_dev_per_nexus. If that is set then at
open() scsi.c will do vhost_dev_init with 1 vq. The vq's handle_kick
function is a dummy no op (vhost_scsi_no_op_kick), because at this time we
don't know if this device's vq is going to be a evt, ctl or IO vq.

Userpsace would then do open() N times for each vq it wanted to create.

Qemu then does it's dev and ring/vq setup.

Lastly qemu does the ioctl that calls into vhost_scsi_set_endpoint. For this
function scsi has to figure out if it's a md device or old style one. If a md
device then we figure out which vq this will be in the new function
vhost_scsi_md_add_vs().

Here is where it gets a little gross. Because we don't know what type of
vq it is at vhost_dev_init/open time, I've added a new function
vhost_vq_reset_kick_handler which just resets the handle_kick callout
that we had setup in vhost_dev_init. We call this in vhost_scsi_md_add_vs
to set the correct handle_kick function.



diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 774bffe..f18f7b1 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -62,6 +62,12 @@
  */
 #define VHOST_SCSI_WEIGHT 256
 
+static bool vhost_scsi_multi_dev_per_nexus;
+module_param_named(multiple_vhost_devs_per_nexus,
+		   vhost_scsi_multi_dev_per_nexus, bool, 0644);
+MODULE_PARM_DESC(multiple_vhost_devs_per_nexus,
+		 "Turn on support for combing multiple vhost-scsi device instances into a single I_T Nexus. Set to true to turn on. Default is off.");
+
 struct vhost_scsi_inflight {
 	/* Wait for the flush operation to finish */
 	struct completion comp;
@@ -127,7 +133,7 @@ struct vhost_scsi_tpg {
 	int tv_tpg_vhost_count;
 	/* Used for enabling T10-PI with legacy devices */
 	int tv_fabric_prot_type;
-	/* list for vhost_scsi_list */
+	/* list for vhost_scsi_tpg_list */
 	struct list_head tv_tpg_list;
 	/* Used to protect access for tpg_nexus */
 	struct mutex tv_tpg_mutex;
@@ -137,7 +143,7 @@ struct vhost_scsi_tpg {
 	struct vhost_scsi_tport *tport;
 	/* Returned by vhost_scsi_make_tpg() */
 	struct se_portal_group se_tpg;
-	/* Pointer back to vhost_scsi, protected by tv_tpg_mutex */
+	/* Pointer back to vhost_scsi used for events, protected by tv_tpg_mutex */
 	struct vhost_scsi *vhost_scsi;
 	struct list_head tmf_queue;
 };
@@ -194,13 +200,22 @@ struct vhost_scsi_virtqueue {
 	int max_cmds;
 };
 
+struct vhost_scsi_md {
+	struct list_head vhost_scsi_md_list_entry;
+	struct list_head vhost_scsi_list;
+	int vs_cnt;
+};
+
 struct vhost_scsi {
 	/* Protected by vhost_scsi->dev.mutex */
 	struct vhost_scsi_tpg **vs_tpg;
+	struct list_head vhost_scsi_list_entry;
 	char vs_vhost_wwpn[TRANSPORT_IQN_LEN];
+	struct vhost_scsi_md *md;
+	bool md_enabled;
 
 	struct vhost_dev dev;
-	struct vhost_scsi_virtqueue vqs[VHOST_SCSI_MAX_VQ];
+	struct vhost_scsi_virtqueue *vqs;
 
 	struct vhost_work vs_completion_work; /* cmd completion work item */
 	struct llist_head vs_completion_list; /* cmd completion queue */
@@ -242,8 +257,11 @@ struct vhost_scsi_ctx {
 static struct workqueue_struct *vhost_scsi_workqueue;
 
 /* Global spinlock to protect vhost_scsi TPG list for vhost IOCTL access */
-static DEFINE_MUTEX(vhost_scsi_mutex);
-static LIST_HEAD(vhost_scsi_list);
+static DEFINE_MUTEX(vhost_scsi_tpg_mutex);
+static LIST_HEAD(vhost_scsi_tpg_list);
+
+/* List of multiple device (mq) devs accesed under the vhost_scsi_tpg_mutex */
+static LIST_HEAD(vhost_scsi_md_list);
 
 static void vhost_scsi_done_inflight(struct kref *kref)
 {
@@ -260,7 +278,7 @@ static void vhost_scsi_init_inflight(struct vhost_scsi *vs,
 	struct vhost_virtqueue *vq;
 	int idx, i;
 
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) {
+	for (i = 0; i < vs->dev.max_nvqs; i++) {
 		vq = &vs->vqs[i].vq;
 
 		mutex_lock(&vq->mutex);
@@ -588,8 +606,7 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 	}
 
 	vq = -1;
-	while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1))
-		< VHOST_SCSI_MAX_VQ)
+	while ((vq = find_next_bit(signal, vs->dev.nvqs, vq + 1)) < vs->dev.nvqs)
 		vhost_signal(&vs->dev, &vs->vqs[vq].vq);
 }
 
@@ -1443,6 +1460,11 @@ static void vhost_scsi_handle_kick(struct vhost_work *work)
 	vhost_scsi_handle_vq(vs, vq);
 }
 
+static void vhost_scsi_no_op_kick(struct vhost_work *work)
+{
+	pr_err("Invalid no op kick call\n");
+}
+
 /* Callers must hold dev mutex */
 static void vhost_scsi_flush(struct vhost_scsi *vs)
 {
@@ -1457,14 +1479,14 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
 	 * indicate the start of the flush operation so that it will reach 0
 	 * when all the reqs are finished.
 	 */
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
+	for (i = 0; i < vs->dev.nvqs; i++)
 		kref_put(&old_inflight[i]->kref, vhost_scsi_done_inflight);
 
 	/* Flush both the vhost poll and vhost work */
 	vhost_work_dev_flush(&vs->dev);
 
 	/* Wait for all reqs issued before the flush to be finished */
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++)
+	for (i = 0; i < vs->dev.nvqs; i++)
 		wait_for_completion(&old_inflight[i]->comp);
 }
 
@@ -1545,12 +1567,87 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	return -ENOMEM;
 }
 
+static void vhost_scsi_md_del_vs(struct vhost_scsi *vs)
+{
+	struct vhost_scsi_md *md;
+
+	if (!vs->md_enabled)
+		return;
+
+	if (list_empty(&vs->vhost_scsi_list_entry))
+		return;
+
+	md = vs->md;
+	vs->md = NULL;
+	md->vs_cnt--;
+	list_del_init(&vs->vhost_scsi_list_entry);
+
+	if (!md->vs_cnt) {
+		list_del(&md->vhost_scsi_md_list_entry);
+		kfree(md);
+	}
+}
+
+static int vhost_scsi_md_add_vs(struct vhost_scsi *vs,
+				struct vhost_scsi_target *tgt)
+{
+	struct vhost_scsi *lead_vs;
+	struct vhost_scsi_md *md;
+
+	if (!vs->md_enabled)
+		return 0;
+
+	if (!list_empty(&vs->vhost_scsi_list_entry))
+		return 0;
+
+	list_for_each_entry(md, &vhost_scsi_md_list, vhost_scsi_md_list_entry) {
+		lead_vs = list_first_entry(&md->vhost_scsi_list,
+					   struct vhost_scsi,
+					   vhost_scsi_list_entry);
+		if (memcmp(lead_vs->vs_vhost_wwpn, tgt->vhost_wwpn,
+			   sizeof(tgt->vhost_wwpn)))
+			continue;
+
+		goto add_vs;
+	}
+
+	md = kzalloc(sizeof(*md), GFP_KERNEL);
+	if (md)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&md->vhost_scsi_list);
+	INIT_LIST_HEAD(&md->vhost_scsi_md_list_entry);
+
+	list_add_tail(&md->vhost_scsi_md_list_entry, &vhost_scsi_md_list);
+
+add_vs:
+	switch (md->vs_cnt) {
+	case VHOST_SCSI_VQ_CTL:
+		vhost_vq_reset_kick_handler(&vs->vqs[0].vq,
+					    vhost_scsi_ctl_handle_kick);
+		break;
+	case VHOST_SCSI_VQ_EVT:
+		vhost_vq_reset_kick_handler(&vs->vqs[0].vq,
+					    vhost_scsi_evt_handle_kick);
+		break;
+	default:
+		vhost_vq_reset_kick_handler(&vs->vqs[0].vq,
+					    vhost_scsi_handle_kick);
+		break;
+	}
+
+	vs->md = md;
+	md->vs_cnt++;
+	list_add_tail(&vs->vhost_scsi_list_entry, &md->vhost_scsi_list);
+	return 0;
+}
+
 /*
  * Called from vhost_scsi_ioctl() context to walk the list of available
  * vhost_scsi_tpg with an active struct vhost_scsi_nexus
  *
  *  The lock nesting rule is:
- *    vhost_scsi_mutex -> vs->dev.mutex -> tpg->tv_tpg_mutex -> vq->mutex
+ *    vhost_scsi_tpg_mutex -> vs->dev.mutex -> tpg->tv_tpg_mutex -> vq->mutex
  */
 static int
 vhost_scsi_set_endpoint(struct vhost_scsi *vs,
@@ -1564,7 +1661,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	int index, ret, i, len;
 	bool match = false;
 
-	mutex_lock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
 	mutex_lock(&vs->dev.mutex);
 
 	/* Verify that ring has been setup correctly. */
@@ -1585,13 +1682,14 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	if (vs->vs_tpg)
 		memcpy(vs_tpg, vs->vs_tpg, len);
 
-	list_for_each_entry(tpg, &vhost_scsi_list, tv_tpg_list) {
+	list_for_each_entry(tpg, &vhost_scsi_tpg_list, tv_tpg_list) {
 		mutex_lock(&tpg->tv_tpg_mutex);
 		if (!tpg->tpg_nexus) {
 			mutex_unlock(&tpg->tv_tpg_mutex);
 			continue;
 		}
-		if (tpg->tv_tpg_vhost_count != 0) {
+		if (!vhost_scsi_multi_dev_per_nexus &&
+		    tpg->tv_tpg_vhost_count != 0) {
 			mutex_unlock(&tpg->tv_tpg_mutex);
 			continue;
 		}
@@ -1616,8 +1714,19 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 				mutex_unlock(&tpg->tv_tpg_mutex);
 				goto undepend;
 			}
+
+			ret = vhost_scsi_md_add_vs(vs, t);
+			if (ret)
+				goto undepend;
+
+			/*
+			 * In md mode the first vs added will be used for the
+			 * event queue. In non-md mode we only have the 1 vs.
+			 */
+			if (!tpg->vhost_scsi)
+				tpg->vhost_scsi = vs;
+
 			tpg->tv_tpg_vhost_count++;
-			tpg->vhost_scsi = vs;
 			vs_tpg[tpg->tport_tpgt] = tpg;
 			match = true;
 		}
@@ -1628,7 +1737,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 		memcpy(vs->vs_vhost_wwpn, t->vhost_wwpn,
 		       sizeof(vs->vs_vhost_wwpn));
 
-		for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
+		for (i = VHOST_SCSI_VQ_IO; i < vs->dev.nvqs; i++) {
 			vq = &vs->vqs[i].vq;
 			if (!vq->initialized)
 				continue;
@@ -1637,7 +1746,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 				goto destroy_vq_cmds;
 		}
 
-		for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) {
+		for (i = 0; i < vs->dev.nvqs; i++) {
 			vq = &vs->vqs[i].vq;
 			if (!vq->initialized)
 				continue;
@@ -1670,6 +1779,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	for (i = 0; i < VHOST_SCSI_MAX_TARGET; i++) {
 		tpg = vs_tpg[i];
 		if (tpg) {
+			vhost_scsi_md_del_vs(vs);
 			tpg->tv_tpg_vhost_count--;
 			target_undepend_item(&tpg->se_tpg.tpg_group.cg_item);
 		}
@@ -1677,7 +1787,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	kfree(vs_tpg);
 out:
 	mutex_unlock(&vs->dev.mutex);
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 	return ret;
 }
 
@@ -1693,7 +1803,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	int index, ret, i;
 	u8 target;
 
-	mutex_lock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
 	mutex_lock(&vs->dev.mutex);
 	/* Verify that ring has been setup correctly. */
 	for (index = 0; index < vs->dev.max_nvqs; ++index) {
@@ -1732,6 +1842,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 		tpg->tv_tpg_vhost_count--;
 		tpg->vhost_scsi = NULL;
 		vs->vs_tpg[target] = NULL;
+		vhost_scsi_md_del_vs(vs);
 		match = true;
 		mutex_unlock(&tpg->tv_tpg_mutex);
 		/*
@@ -1742,7 +1853,7 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 		target_undepend_item(&se_tpg->tpg_group.cg_item);
 	}
 	if (match) {
-		for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) {
+		for (i = 0; i < vs->dev.nvqs; i++) {
 			vq = &vs->vqs[i].vq;
 			if (!vq->initialized)
 				continue;
@@ -1767,14 +1878,14 @@ static int vhost_scsi_setup_vq_cmds(struct vhost_virtqueue *vq, int max_cmds)
 	vs->vs_tpg = NULL;
 	WARN_ON(vs->vs_events_nr);
 	mutex_unlock(&vs->dev.mutex);
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 	return 0;
 
 err_tpg:
 	mutex_unlock(&tpg->tv_tpg_mutex);
 err_dev:
 	mutex_unlock(&vs->dev.mutex);
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 	return ret;
 }
 
@@ -1793,7 +1904,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 		return -EFAULT;
 	}
 
-	for (i = 0; i < VHOST_SCSI_MAX_VQ; i++) {
+	for (i = 0; i < vs->dev.nvqs; i++) {
 		vq = &vs->vqs[i].vq;
 		mutex_lock(&vq->mutex);
 		vq->acked_features = features;
@@ -1803,11 +1914,48 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features)
 	return 0;
 }
 
+static struct vhost_virtqueue **
+vhost_scsi_vqs_init(struct vhost_scsi *vs, int max_nvqs)
+{
+	struct vhost_virtqueue **vqs;
+	int i;
+
+	vs->vqs = kcalloc(max_nvqs, sizeof(*vs->vqs), GFP_KERNEL);
+	if (!vs->vqs)
+		return NULL;
+
+	vqs = kcalloc(max_nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs)
+		goto err_vqs;
+
+	if (!vs->md_enabled) {
+		vqs[VHOST_SCSI_VQ_CTL] = &vs->vqs[VHOST_SCSI_VQ_CTL].vq;
+		vqs[VHOST_SCSI_VQ_EVT] = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
+		vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick =
+						vhost_scsi_ctl_handle_kick;
+		vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick =
+						vhost_scsi_evt_handle_kick;
+		for (i = VHOST_SCSI_VQ_IO; i < max_nvqs; i++) {
+			vqs[i] = &vs->vqs[i].vq;
+			vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
+		}
+	} else {
+		vqs[0] = &vs->vqs[0].vq;
+		vs->vqs[0].vq.handle_kick = vhost_scsi_no_op_kick;
+	}
+
+	return vqs;
+
+err_vqs:
+	kfree(vs->vqs);
+	return NULL;
+}
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
 	struct vhost_scsi *vs;
 	struct vhost_virtqueue **vqs;
-	int r = -ENOMEM, i;
+	int r = -ENOMEM, nvqs, max_nvqs;
 
 	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!vs) {
@@ -1815,10 +1963,20 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 		if (!vs)
 			goto err_vs;
 	}
+	INIT_LIST_HEAD(&vs->vhost_scsi_list_entry);
+	vs->md_enabled = vhost_scsi_multi_dev_per_nexus;
 
-	vqs = kmalloc_array(VHOST_SCSI_MAX_VQ, sizeof(*vqs), GFP_KERNEL);
-	if (!vqs)
-		goto err_vqs;
+	if (vs->md_enabled) {
+		max_nvqs = 1;
+		nvqs = 1;
+	} else {
+		/*
+		 * We will always need the ctl, evt and at least 1 IO vq.
+		 * Create more IO vqs if userspace requests them.
+		 */
+		max_nvqs = VHOST_SCSI_MAX_VQ;
+		nvqs = 3;
+	}
 
 	vhost_work_init(&vs->vs_completion_work, vhost_scsi_complete_cmd_work);
 	vhost_work_init(&vs->vs_event_work, vhost_scsi_evt_work);
@@ -1826,20 +1984,11 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	vs->vs_events_nr = 0;
 	vs->vs_events_missed = false;
 
-	vqs[VHOST_SCSI_VQ_CTL] = &vs->vqs[VHOST_SCSI_VQ_CTL].vq;
-	vqs[VHOST_SCSI_VQ_EVT] = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
-	vs->vqs[VHOST_SCSI_VQ_CTL].vq.handle_kick = vhost_scsi_ctl_handle_kick;
-	vs->vqs[VHOST_SCSI_VQ_EVT].vq.handle_kick = vhost_scsi_evt_handle_kick;
-	for (i = VHOST_SCSI_VQ_IO; i < VHOST_SCSI_MAX_VQ; i++) {
-		vqs[i] = &vs->vqs[i].vq;
-		vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
-	}
+	vqs = vhost_scsi_vqs_init(vs, max_nvqs);
+	if (!vqs)
+		goto err_vqs_init;
 
-	/*
-	 * We will always need the ctl, evt and at least 1 IO vq. Create more
-	 * IO vqs if userspace requests them.
-	 */
-	r = vhost_dev_init(&vs->dev, vqs, 3, VHOST_SCSI_MAX_VQ, UIO_MAXIOV,
+	r = vhost_dev_init(&vs->dev, vqs, nvqs, max_nvqs, UIO_MAXIOV,
 			   VHOST_SCSI_WEIGHT, 0, true, NULL);
 	if (r)
 		goto err_dev_init;
@@ -1851,7 +2000,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 
 err_dev_init:
 	kfree(vqs);
-err_vqs:
+	kfree(vs->vqs);
+err_vqs_init:
 	kvfree(vs);
 err_vs:
 	return r;
@@ -1871,6 +2021,7 @@ static int vhost_scsi_release(struct inode *inode, struct file *f)
 	/* Jobs can re-queue themselves in evt kick handler. Do extra flush. */
 	vhost_scsi_flush(vs);
 	kfree(vs->dev.vqs);
+	kfree(vs->vqs);
 	kvfree(vs);
 	return 0;
 }
@@ -2035,7 +2186,7 @@ static int vhost_scsi_port_link(struct se_portal_group *se_tpg,
 	INIT_LIST_HEAD(&tmf->queue_entry);
 	vhost_work_init(&tmf->vwork, vhost_scsi_tmf_resp_work);
 
-	mutex_lock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
 
 	mutex_lock(&tpg->tv_tpg_mutex);
 	tpg->tv_tpg_port_count++;
@@ -2044,7 +2195,7 @@ static int vhost_scsi_port_link(struct se_portal_group *se_tpg,
 
 	vhost_scsi_hotplug(tpg, lun);
 
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 
 	return 0;
 }
@@ -2056,7 +2207,7 @@ static void vhost_scsi_port_unlink(struct se_portal_group *se_tpg,
 				struct vhost_scsi_tpg, se_tpg);
 	struct vhost_scsi_tmf *tmf;
 
-	mutex_lock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
 
 	mutex_lock(&tpg->tv_tpg_mutex);
 	tpg->tv_tpg_port_count--;
@@ -2068,7 +2219,7 @@ static void vhost_scsi_port_unlink(struct se_portal_group *se_tpg,
 
 	vhost_scsi_hotunplug(tpg, lun);
 
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 }
 
 static ssize_t vhost_scsi_tpg_attrib_fabric_prot_type_store(
@@ -2333,9 +2484,9 @@ static ssize_t vhost_scsi_tpg_nexus_store(struct config_item *item,
 		kfree(tpg);
 		return NULL;
 	}
-	mutex_lock(&vhost_scsi_mutex);
-	list_add_tail(&tpg->tv_tpg_list, &vhost_scsi_list);
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
+	list_add_tail(&tpg->tv_tpg_list, &vhost_scsi_tpg_list);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 
 	return &tpg->se_tpg;
 }
@@ -2345,9 +2496,9 @@ static void vhost_scsi_drop_tpg(struct se_portal_group *se_tpg)
 	struct vhost_scsi_tpg *tpg = container_of(se_tpg,
 				struct vhost_scsi_tpg, se_tpg);
 
-	mutex_lock(&vhost_scsi_mutex);
+	mutex_lock(&vhost_scsi_tpg_mutex);
 	list_del(&tpg->tv_tpg_list);
-	mutex_unlock(&vhost_scsi_mutex);
+	mutex_unlock(&vhost_scsi_tpg_mutex);
 	/*
 	 * Release the virtual I_T Nexus for this vhost TPG
 	 */
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ca2e71..d44351c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -199,6 +199,25 @@ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 }
 EXPORT_SYMBOL_GPL(vhost_poll_init);
 
+static void vhost_poll_reinit_work(struct vhost_poll *poll, vhost_work_fn_t fn)
+{
+	vhost_work_init(&poll->work, fn);
+}
+
+/**
+ * vhost_vq_reset_kick_handler: reset the vq's kick handler
+ * @vq: vq to reset
+ * @fn: new kick function
+ *
+ * This must be called before the vq is activated
+ */
+void vhost_vq_reset_kick_handler(struct vhost_virtqueue *vq, vhost_work_fn_t fn)
+{
+	vq->handle_kick = fn;
+	vhost_poll_reinit_work(&vq->poll, fn);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_reset_kick_handler);
+
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
  * keep a reference to a file until after vhost_poll_stop is called. */
 int vhost_poll_start(struct vhost_poll *poll, struct file *file)
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 08c5aef..d748cb7 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -193,6 +193,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
 		      struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+void vhost_vq_reset_kick_handler(struct vhost_virtqueue *vq, vhost_work_fn_t fn);
 int vhost_vq_init_access(struct vhost_virtqueue *);
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
 int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,

Jason Wang Nov. 2, 2020, 6:36 a.m. UTC | #8

On 2020/10/30 下午4:47, Michael S. Tsirkin wrote:
> On Tue, Oct 27, 2020 at 12:47:34AM -0500, Mike Christie wrote:
>> On 10/25/20 10:51 PM, Jason Wang wrote:
>>> On 2020/10/22 上午8:34, Mike Christie wrote:
>>>> Each vhost-scsi device will need a evt and ctl queue, but the number
>>>> of IO queues depends on whatever the user has configured in userspace.
>>>> This patch has vhost-scsi create the evt, ctl and one IO vq at device
>>>> open time. We then create the other IO vqs when userspace starts to
>>>> set them up. We still waste some mem on the vq and scsi vq structs,
>>>> but we don't waste mem on iovec related arrays and for later patches
>>>> we know which queues are used by the dev->nvqs value.
>>>>
>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>>>> ---
>>>>    drivers/vhost/scsi.c | 19 +++++++++++++++----
>>>>    1 file changed, 15 insertions(+), 4 deletions(-)
>>>
>>> Not familiar with SCSI. But I wonder if it could behave like vhost-net.
>>>
>>> E.g userspace should known the number of virtqueues so it can just open
>>> and close multiple vhost-scsi file descriptors.
>>>
>> One hiccup I'm hitting is that we might end up creating about 3x more vqs
>> than we need. The problem is that for scsi each vhost device has:
>>
>> vq=0: special control vq
>> vq=1: event vq
>> vq=2 and above: SCSI CMD/IO vqs. We want to create N of these.
>>
>> Today we do:
>>
>> Uerspace does open(/dev/vhost-scsi)
>>          vhost_dev_init(create 128 vqs and then later we setup and use N of
>> them);
>>
>> Qemu does ioctl(VHOST_SET_OWNER)
>>          vhost_dev_set_owner()
>>
>> For N vqs userspace does:
>>          // virtqueue setup related ioctls
>>
>> Qemu does ioctl(VHOST_SCSI_SET_ENDPOINT)
>>          - match LIO/target port to vhost_dev
>>
>>
>> So we could change that to:
>>
>> For N IO vqs userspace does
>>          open(/dev/vhost-scsi)
>>                  vhost_dev_init(create IO, evt, and ctl);
>>
>> for N IO vqs Qemu does:
>>          ioctl(VHOST_SET_OWNER)
>>                  vhost_dev_set_owner()
>>
>> for N IO vqs Qemu does:
>>          // virtqueue setup related ioctls
>>
>> for N IO vqs Qemu does:
>>          ioctl(VHOST_SCSI_SET_ENDPOINT)
>>                  - match LIO/target port to vhost_dev and assemble the
>> multiple vhost_dev device.
>>
>> The problem is that we have to setup some of the evt/ctl specific parts at
>> open() time when vhost_dev_init does vhost_poll_init for example.
>>
>> - At open time, we don't know if this vhost_dev is going to be part of a
>> multiple vhost_device device or a single one so we need to create at least 3
>> of them
>> - If it is a multiple device we don't know if its the first device being
>> created for the device or the N'th, so we don't know if the dev's vqs will
>> be used for IO or ctls/evts, so we have to create all 3.
>>
>> When we get the first VHOST_SCSI_SET_ENDPOINT call for a new style multiple
>> vhost_dev device, we can use that dev's evt/ctl vqs for events/controls
>> requests. When we get the other VHOST_SCSI_SET_ENDPOINT calls for the
>> multiple vhost_dev device then those dev's evt/ctl vqs will be ignored and
>> we will only use their IO vqs. So we end up with a lot of extra vqs.
> The issue Jason's hinting at is how can admins control the amount
> of resources a given qemu instance can consume?
> After all vhost vqs all live in host kernel memory ...
> Limiting # of open fds would be one way to do that ...
>
> The need to share event/control vqs between devices is a problem though,
> and sending lots of ioctls on things like reset is also not that elegant.
> Jason, did you have a good solution in mind?


Nope, I'm not familiar with SCSI so I don't even know sharing evt/cvq is 
possible. Consider VHOST_SCSI_MAX_VQ is already 128 per device. Mike's 
proposal seems to be better.

Thanks


>
>> One other question/issue I have is that qemu can open the /dev/vhost-scsi
>> device or it allows tools like libvirtd to open the device and pass in the
>> fd to use. For the latter case, would we continue to have those tools pass
>> in the leading fd, then have qemu do the other num_queues - 1
>> open(/dev/vhost-scsi) calls? Or do these apps that pass in the fd need to
>> know about all of the fds for some management reason?
> They know about all the fds, for resource control and priveledge
> separation reasons.
>

Jason Wang Nov. 2, 2020, 6:49 a.m. UTC | #9

On 2020/11/2 下午2:36, Jason Wang wrote:
>>
>> The need to share event/control vqs between devices is a problem though,
>> and sending lots of ioctls on things like reset is also not that 
>> elegant.
>> Jason, did you have a good solution in mind?
>
>
> Nope, I'm not familiar with SCSI so I don't even know sharing evt/cvq 
> is possible. Consider VHOST_SCSI_MAX_VQ is already 128 per device. 
> Mike's proposal seems to be better.
>
> Thanks 


Btw, it looks to me vhost_scsi_do_evt_work() has the assumption of iovec 
layout which needs to be fixed.

Thanks

Mike Christie Nov. 2, 2020, 4:19 p.m. UTC | #10

On 11/2/20 12:49 AM, Jason Wang wrote:
> 
> On 2020/11/2 下午2:36, Jason Wang wrote:
>>>
>>> The need to share event/control vqs between devices is a problem though,
>>> and sending lots of ioctls on things like reset is also not that 
>>> elegant.
>>> Jason, did you have a good solution in mind?
>>
>>
>> Nope, I'm not familiar with SCSI so I don't even know sharing evt/cvq 
>> is possible. Consider VHOST_SCSI_MAX_VQ is already 128 per device. 
>> Mike's proposal seems to be better.

Hey, which proposal are you saying was best?

1. Add on to the current scsi mq design where we are doing a single 
device and multiple vqs already. So basically just fix what we have and 
add in patches 12 - 16 to do a thread per VQ?

2. The proposal I stated to hack up over the weekend to try and support 
the current design and then add in support for your multiple device 
single vq design:

http://archive.lwn.net:8080/linux-scsi/292879d9-915d-8587-0678-8677a800c613@oracle.com/

>>
>> Thanks 
> 
> 
> Btw, it looks to me vhost_scsi_do_evt_work() has the assumption of iovec 
> layout which needs to be fixed.

I wanted to be clear, because I thought you meant #1, but this comment 
seems like it would only be for #2.

[07/17] vhost scsi: support delayed IO vq creation

Commit Message

Comments

Patch