mbox series

[V1,vfio,0/5] Improve mlx5 driver to better handle some error cases

Message ID 20240205124828.232701-1-yishaih@nvidia.com (mailing list archive)
Headers show
Series Improve mlx5 driver to better handle some error cases | expand

Message

Yishai Hadas Feb. 5, 2024, 12:48 p.m. UTC
This series improves the mlx5 driver to better handle some error cases
as of below.

The first two patches let the driver recognize whether the firmware
moved the tracker object to an error state. In that case, the driver
will skip/block any usage of that object.

The next two patches (#3, #4), improve the driver to better include the
proper firmware syndrome in dmesg upon a failure in some firmware
commands.

The last patch follows the device specification to let the firmware know
upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
migration cancellation, etc.).

This will let the firmware clean its internal resources that were turned
on upon PRE_COPY.

Note:
As the first patch should go to net/mlx5, we may need to send it as a
pull request format to vfio before acceptance of the series, to avoid
conflicts.

Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-yishaih@nvidia.com/
Patch #2:
- Rename to use 'object changed' in some places to make it clearer.
- Enhance the commit log to better clarify the usage/use case.

The above was suggested by Tian, Kevin <kevin.tian@intel.com>.

Yishai

Yishai Hadas (5):
  net/mlx5: Add the IFC related bits for query tracker
  vfio/mlx5: Add support for tracker object change event
  vfio/mlx5: Handle the EREMOTEIO error upon the SAVE command
  vfio/mlx5: Block incremental query upon migf state error
  vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNING

 drivers/vfio/pci/mlx5/cmd.c   | 74 ++++++++++++++++++++++++++++++++---
 drivers/vfio/pci/mlx5/cmd.h   |  5 ++-
 drivers/vfio/pci/mlx5/main.c  | 39 ++++++++++++++----
 include/linux/mlx5/mlx5_ifc.h |  5 +++
 4 files changed, 110 insertions(+), 13 deletions(-)

Comments

Tian, Kevin Feb. 6, 2024, 7:35 a.m. UTC | #1
> From: Yishai Hadas <yishaih@nvidia.com>
> Sent: Monday, February 5, 2024 8:48 PM
> 
> This series improves the mlx5 driver to better handle some error cases
> as of below.
> 
> The first two patches let the driver recognize whether the firmware
> moved the tracker object to an error state. In that case, the driver
> will skip/block any usage of that object.
> 
> The next two patches (#3, #4), improve the driver to better include the
> proper firmware syndrome in dmesg upon a failure in some firmware
> commands.
> 
> The last patch follows the device specification to let the firmware know
> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
> migration cancellation, etc.).
> 
> This will let the firmware clean its internal resources that were turned
> on upon PRE_COPY.
> 
> Note:
> As the first patch should go to net/mlx5, we may need to send it as a
> pull request format to vfio before acceptance of the series, to avoid
> conflicts.
> 
> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
> yishaih@nvidia.com/
> Patch #2:
> - Rename to use 'object changed' in some places to make it clearer.
> - Enhance the commit log to better clarify the usage/use case.
> 
> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
> 

this series looks good to me except a small remark on patch2:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Yishai Hadas Feb. 6, 2024, 8:06 a.m. UTC | #2
On 06/02/2024 9:35, Tian, Kevin wrote:
>> From: Yishai Hadas <yishaih@nvidia.com>
>> Sent: Monday, February 5, 2024 8:48 PM
>>
>> This series improves the mlx5 driver to better handle some error cases
>> as of below.
>>
>> The first two patches let the driver recognize whether the firmware
>> moved the tracker object to an error state. In that case, the driver
>> will skip/block any usage of that object.
>>
>> The next two patches (#3, #4), improve the driver to better include the
>> proper firmware syndrome in dmesg upon a failure in some firmware
>> commands.
>>
>> The last patch follows the device specification to let the firmware know
>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
>> migration cancellation, etc.).
>>
>> This will let the firmware clean its internal resources that were turned
>> on upon PRE_COPY.
>>
>> Note:
>> As the first patch should go to net/mlx5, we may need to send it as a
>> pull request format to vfio before acceptance of the series, to avoid
>> conflicts.
>>
>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
>> yishaih@nvidia.com/
>> Patch #2:
>> - Rename to use 'object changed' in some places to make it clearer.
>> - Enhance the commit log to better clarify the usage/use case.
>>
>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
>>
> 
> this series looks good to me except a small remark on patch2:

We should be fine there, see my answer on V0.

> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks Kevin, for your reviewed-by.

Yishai
Yishai Hadas Feb. 8, 2024, 8:16 a.m. UTC | #3
On 06/02/2024 10:06, Yishai Hadas wrote:
> On 06/02/2024 9:35, Tian, Kevin wrote:
>>> From: Yishai Hadas <yishaih@nvidia.com>
>>> Sent: Monday, February 5, 2024 8:48 PM
>>>
>>> This series improves the mlx5 driver to better handle some error cases
>>> as of below.
>>>
>>> The first two patches let the driver recognize whether the firmware
>>> moved the tracker object to an error state. In that case, the driver
>>> will skip/block any usage of that object.
>>>
>>> The next two patches (#3, #4), improve the driver to better include the
>>> proper firmware syndrome in dmesg upon a failure in some firmware
>>> commands.
>>>
>>> The last patch follows the device specification to let the firmware know
>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
>>> migration cancellation, etc.).
>>>
>>> This will let the firmware clean its internal resources that were turned
>>> on upon PRE_COPY.
>>>
>>> Note:
>>> As the first patch should go to net/mlx5, we may need to send it as a
>>> pull request format to vfio before acceptance of the series, to avoid
>>> conflicts.
>>>
>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
>>> yishaih@nvidia.com/
>>> Patch #2:
>>> - Rename to use 'object changed' in some places to make it clearer.
>>> - Enhance the commit log to better clarify the usage/use case.
>>>
>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
>>>
>>
>> this series looks good to me except a small remark on patch2:
> 
> We should be fine there, see my answer on V0.
> 
>>
>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> Thanks Kevin, for your reviewed-by.
> 
> Yishai
> 

Alex

Are we OK here to continue with a PR for the first patch ?

It seems that we should be fine here.

Thanks,
Yishai
Yishai Hadas Feb. 21, 2024, 7:45 a.m. UTC | #4
On 08/02/2024 10:16, Yishai Hadas wrote:
> On 06/02/2024 10:06, Yishai Hadas wrote:
>> On 06/02/2024 9:35, Tian, Kevin wrote:
>>>> From: Yishai Hadas <yishaih@nvidia.com>
>>>> Sent: Monday, February 5, 2024 8:48 PM
>>>>
>>>> This series improves the mlx5 driver to better handle some error cases
>>>> as of below.
>>>>
>>>> The first two patches let the driver recognize whether the firmware
>>>> moved the tracker object to an error state. In that case, the driver
>>>> will skip/block any usage of that object.
>>>>
>>>> The next two patches (#3, #4), improve the driver to better include the
>>>> proper firmware syndrome in dmesg upon a failure in some firmware
>>>> commands.
>>>>
>>>> The last patch follows the device specification to let the firmware 
>>>> know
>>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
>>>> migration cancellation, etc.).
>>>>
>>>> This will let the firmware clean its internal resources that were 
>>>> turned
>>>> on upon PRE_COPY.
>>>>
>>>> Note:
>>>> As the first patch should go to net/mlx5, we may need to send it as a
>>>> pull request format to vfio before acceptance of the series, to avoid
>>>> conflicts.
>>>>
>>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
>>>> yishaih@nvidia.com/
>>>> Patch #2:
>>>> - Rename to use 'object changed' in some places to make it clearer.
>>>> - Enhance the commit log to better clarify the usage/use case.
>>>>
>>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
>>>>
>>>
>>> this series looks good to me except a small remark on patch2:
>>
>> We should be fine there, see my answer on V0.
>>
>>>
>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>
>> Thanks Kevin, for your reviewed-by.
>>
>> Yishai
>>
> 
> Alex
> 
> Are we OK here to continue with a PR for the first patch ?
> 
> It seems that we should be fine here.
> 
> Thanks,
> Yishai
> 

Hi Alex,
Any update here ?

Thanks,
Yishai
Alex Williamson Feb. 22, 2024, 6:04 p.m. UTC | #5
On Wed, 21 Feb 2024 09:45:14 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> On 08/02/2024 10:16, Yishai Hadas wrote:
> > On 06/02/2024 10:06, Yishai Hadas wrote:  
> >> On 06/02/2024 9:35, Tian, Kevin wrote:  
> >>>> From: Yishai Hadas <yishaih@nvidia.com>
> >>>> Sent: Monday, February 5, 2024 8:48 PM
> >>>>
> >>>> This series improves the mlx5 driver to better handle some error cases
> >>>> as of below.
> >>>>
> >>>> The first two patches let the driver recognize whether the firmware
> >>>> moved the tracker object to an error state. In that case, the driver
> >>>> will skip/block any usage of that object.
> >>>>
> >>>> The next two patches (#3, #4), improve the driver to better include the
> >>>> proper firmware syndrome in dmesg upon a failure in some firmware
> >>>> commands.
> >>>>
> >>>> The last patch follows the device specification to let the firmware 
> >>>> know
> >>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
> >>>> migration cancellation, etc.).
> >>>>
> >>>> This will let the firmware clean its internal resources that were 
> >>>> turned
> >>>> on upon PRE_COPY.
> >>>>
> >>>> Note:
> >>>> As the first patch should go to net/mlx5, we may need to send it as a
> >>>> pull request format to vfio before acceptance of the series, to avoid
> >>>> conflicts.
> >>>>
> >>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
> >>>> yishaih@nvidia.com/
> >>>> Patch #2:
> >>>> - Rename to use 'object changed' in some places to make it clearer.
> >>>> - Enhance the commit log to better clarify the usage/use case.
> >>>>
> >>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
> >>>>  
> >>>
> >>> this series looks good to me except a small remark on patch2:  
> >>
> >> We should be fine there, see my answer on V0.
> >>  
> >>>
> >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>  
> >>
> >> Thanks Kevin, for your reviewed-by.
> >>
> >> Yishai
> >>  
> > 
> > Alex
> > 
> > Are we OK here to continue with a PR for the first patch ?
> > 
> > It seems that we should be fine here.
> > 
> > Thanks,
> > Yishai
> >   
> 
> Hi Alex,
> Any update here ?

Sure, if Leon wants to do a PR for struct
mlx5_ifc_query_page_track_obj_out_bits, that's fine.  The series looks
ok to me.  The struct definition is small enough to go through the vfio
tree with Leon's ack, but I'll leave it to you to do the right thing
relative to potential conflicts.  Thanks,

Alex
Leon Romanovsky Feb. 22, 2024, 6:33 p.m. UTC | #6
On Thu, Feb 22, 2024 at 11:04:05AM -0700, Alex Williamson wrote:
> On Wed, 21 Feb 2024 09:45:14 +0200
> Yishai Hadas <yishaih@nvidia.com> wrote:
> 
> > On 08/02/2024 10:16, Yishai Hadas wrote:
> > > On 06/02/2024 10:06, Yishai Hadas wrote:  
> > >> On 06/02/2024 9:35, Tian, Kevin wrote:  
> > >>>> From: Yishai Hadas <yishaih@nvidia.com>
> > >>>> Sent: Monday, February 5, 2024 8:48 PM
> > >>>>
> > >>>> This series improves the mlx5 driver to better handle some error cases
> > >>>> as of below.
> > >>>>
> > >>>> The first two patches let the driver recognize whether the firmware
> > >>>> moved the tracker object to an error state. In that case, the driver
> > >>>> will skip/block any usage of that object.
> > >>>>
> > >>>> The next two patches (#3, #4), improve the driver to better include the
> > >>>> proper firmware syndrome in dmesg upon a failure in some firmware
> > >>>> commands.
> > >>>>
> > >>>> The last patch follows the device specification to let the firmware 
> > >>>> know
> > >>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
> > >>>> migration cancellation, etc.).
> > >>>>
> > >>>> This will let the firmware clean its internal resources that were 
> > >>>> turned
> > >>>> on upon PRE_COPY.
> > >>>>
> > >>>> Note:
> > >>>> As the first patch should go to net/mlx5, we may need to send it as a
> > >>>> pull request format to vfio before acceptance of the series, to avoid
> > >>>> conflicts.
> > >>>>
> > >>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-
> > >>>> yishaih@nvidia.com/
> > >>>> Patch #2:
> > >>>> - Rename to use 'object changed' in some places to make it clearer.
> > >>>> - Enhance the commit log to better clarify the usage/use case.
> > >>>>
> > >>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
> > >>>>  
> > >>>
> > >>> this series looks good to me except a small remark on patch2:  
> > >>
> > >> We should be fine there, see my answer on V0.
> > >>  
> > >>>
> > >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>  
> > >>
> > >> Thanks Kevin, for your reviewed-by.
> > >>
> > >> Yishai
> > >>  
> > > 
> > > Alex
> > > 
> > > Are we OK here to continue with a PR for the first patch ?
> > > 
> > > It seems that we should be fine here.
> > > 
> > > Thanks,
> > > Yishai
> > >   
> > 
> > Hi Alex,
> > Any update here ?
> 
> Sure, if Leon wants to do a PR for struct
> mlx5_ifc_query_page_track_obj_out_bits, that's fine.  The series looks
> ok to me.  The struct definition is small enough to go through the vfio
> tree with Leon's ack, but I'll leave it to you to do the right thing
> relative to potential conflicts.  Thanks,

Alex, you are right, there is no need to send a PR for the first patch.
Please take it directly through your tree.

We don't have anything in our shared branch this cycle.

Acked-by: Leon Romanovsky <leon@kernel.org>

Thanks

> 
> Alex
>
Alex Williamson Feb. 22, 2024, 9:37 p.m. UTC | #7
On Mon, 5 Feb 2024 14:48:23 +0200
Yishai Hadas <yishaih@nvidia.com> wrote:

> This series improves the mlx5 driver to better handle some error cases
> as of below.
> 
> The first two patches let the driver recognize whether the firmware
> moved the tracker object to an error state. In that case, the driver
> will skip/block any usage of that object.
> 
> The next two patches (#3, #4), improve the driver to better include the
> proper firmware syndrome in dmesg upon a failure in some firmware
> commands.
> 
> The last patch follows the device specification to let the firmware know
> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target,
> migration cancellation, etc.).
> 
> This will let the firmware clean its internal resources that were turned
> on upon PRE_COPY.
> 
> Note:
> As the first patch should go to net/mlx5, we may need to send it as a
> pull request format to vfio before acceptance of the series, to avoid
> conflicts.
> 
> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-yishaih@nvidia.com/
> Patch #2:
> - Rename to use 'object changed' in some places to make it clearer.
> - Enhance the commit log to better clarify the usage/use case.
> 
> The above was suggested by Tian, Kevin <kevin.tian@intel.com>.
> 
> Yishai
> 
> Yishai Hadas (5):
>   net/mlx5: Add the IFC related bits for query tracker
>   vfio/mlx5: Add support for tracker object change event
>   vfio/mlx5: Handle the EREMOTEIO error upon the SAVE command
>   vfio/mlx5: Block incremental query upon migf state error
>   vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNING
> 
>  drivers/vfio/pci/mlx5/cmd.c   | 74 ++++++++++++++++++++++++++++++++---
>  drivers/vfio/pci/mlx5/cmd.h   |  5 ++-
>  drivers/vfio/pci/mlx5/main.c  | 39 ++++++++++++++----
>  include/linux/mlx5/mlx5_ifc.h |  5 +++
>  4 files changed, 110 insertions(+), 13 deletions(-)
> 

Applied to vfio next branch for v6.9.  Thanks,

Alex