mbox series

[v2,0/7] vhost-scsi: Fix crashes and management op hangs

Message ID 20230321020624.13323-1-michael.christie@oracle.com (mailing list archive)
Headers show
Series vhost-scsi: Fix crashes and management op hangs | expand

Message

Mike Christie March 21, 2023, 2:06 a.m. UTC
The following patches were made over Linus tree. The patches fix 3
issues:

1. If a user performs LIO LUN unmapping before the endpoint has been
cleared then we can end up trying to free a bogus tmf struct if the
TMF is still exucuting when we do the unmap.
2. If vhost_scsi_setup_vq_cmds fails we can leave the tpg->vhost_scsi
pointer set and we can end up trying to access a freed struct.
3. Management operations like LUN mapping/unmapping and device addition
hang for 30 seconds or up to N minutes depending on the device.

The problem is that we use a global mutex to protect the list of tpgs
and for accessing the tpg, and to make sure they are flushed. We then
hold that mutex during a lot of management operations. So if you
are just trying to add another device, it will have to wait on another
device if we are in the middle of clearing it's endpoint and it's
waiting on hung IO.

This patchset fixes up the ordering of how we flush IO and release
refcounts and how often the global mutex is used so we don't need to
always hold it

v2:
1. Added fix for possible use after free and merge with a locking
cleanup patch.
2. Added fix for LIO LUN unmap during TMF execution bug.
3. Fixed bug where we needed to hold the tpg mutex instead of the
vhost_scsi_mutex when calling vhost_scsi_do_plug.

Comments

Mike Christie March 21, 2023, 2:29 a.m. UTC | #1
On 3/20/23 9:06 PM, Mike Christie wrote:
> The following patches were made over Linus tree.

Hi Michael, I see you merged my first version of the patchset in your
vhost branch.

Do you want me to just send a followup patchset?

The major diff between the 2 versions:

1. I added the first 2 patches which fix some bugs in the existing code
I found while doing some code review and testing another LIO patchset
plus v1.

Note: The other day I posted that I thought the 3rd patch in v1 caused
the bugs but they were already in the code.

2. In v2 I made one of the patches not need the vhost device lock when
unmapping/mapping LUNs, so you can add new LUNs even if one LUN on the same
vhost_scsi device was hung.

Since it's not regressions with the existing patches, I can just send those
as a followup patchset if that's preferred.
Michael S. Tsirkin March 21, 2023, 7:12 a.m. UTC | #2
On Mon, Mar 20, 2023 at 09:29:50PM -0500, michael.christie@oracle.com wrote:
> On 3/20/23 9:06 PM, Mike Christie wrote:
> > The following patches were made over Linus tree.
> 
> Hi Michael, I see you merged my first version of the patchset in your
> vhost branch.
> 
> Do you want me to just send a followup patchset?
> 
> The major diff between the 2 versions:
> 
> 1. I added the first 2 patches which fix some bugs in the existing code
> I found while doing some code review and testing another LIO patchset
> plus v1.
> 
> Note: The other day I posted that I thought the 3rd patch in v1 caused
> the bugs but they were already in the code.
> 
> 2. In v2 I made one of the patches not need the vhost device lock when
> unmapping/mapping LUNs, so you can add new LUNs even if one LUN on the same
> vhost_scsi device was hung.
> 
> Since it's not regressions with the existing patches, I can just send those
> as a followup patchset if that's preferred.

It's ok, I will drop v1 and replace it with v2.
Do you feel any of this is needed in this release?
Mike Christie March 21, 2023, 5:25 p.m. UTC | #3
On 3/21/23 2:12 AM, Michael S. Tsirkin wrote:
> On Mon, Mar 20, 2023 at 09:29:50PM -0500, michael.christie@oracle.com wrote:
>> On 3/20/23 9:06 PM, Mike Christie wrote:
>>> The following patches were made over Linus tree.
>>
>> Hi Michael, I see you merged my first version of the patchset in your
>> vhost branch.
>>
>> Do you want me to just send a followup patchset?
>>
>> The major diff between the 2 versions:
>>
>> 1. I added the first 2 patches which fix some bugs in the existing code
>> I found while doing some code review and testing another LIO patchset
>> plus v1.
>>
>> Note: The other day I posted that I thought the 3rd patch in v1 caused
>> the bugs but they were already in the code.
>>
>> 2. In v2 I made one of the patches not need the vhost device lock when
>> unmapping/mapping LUNs, so you can add new LUNs even if one LUN on the same
>> vhost_scsi device was hung.
>>
>> Since it's not regressions with the existing patches, I can just send those
>> as a followup patchset if that's preferred.
> 
> It's ok, I will drop v1 and replace it with v2.
> Do you feel any of this is needed in this release?
> 

No. It can wait for 6.4. Thanks.