Message ID | 20240205124828.232701-1-yishaih@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Improve mlx5 driver to better handle some error cases | expand |
> From: Yishai Hadas <yishaih@nvidia.com> > Sent: Monday, February 5, 2024 8:48 PM > > This series improves the mlx5 driver to better handle some error cases > as of below. > > The first two patches let the driver recognize whether the firmware > moved the tracker object to an error state. In that case, the driver > will skip/block any usage of that object. > > The next two patches (#3, #4), improve the driver to better include the > proper firmware syndrome in dmesg upon a failure in some firmware > commands. > > The last patch follows the device specification to let the firmware know > upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, > migration cancellation, etc.). > > This will let the firmware clean its internal resources that were turned > on upon PRE_COPY. > > Note: > As the first patch should go to net/mlx5, we may need to send it as a > pull request format to vfio before acceptance of the series, to avoid > conflicts. > > Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- > yishaih@nvidia.com/ > Patch #2: > - Rename to use 'object changed' in some places to make it clearer. > - Enhance the commit log to better clarify the usage/use case. > > The above was suggested by Tian, Kevin <kevin.tian@intel.com>. > this series looks good to me except a small remark on patch2: Reviewed-by: Kevin Tian <kevin.tian@intel.com>
On 06/02/2024 9:35, Tian, Kevin wrote: >> From: Yishai Hadas <yishaih@nvidia.com> >> Sent: Monday, February 5, 2024 8:48 PM >> >> This series improves the mlx5 driver to better handle some error cases >> as of below. >> >> The first two patches let the driver recognize whether the firmware >> moved the tracker object to an error state. In that case, the driver >> will skip/block any usage of that object. >> >> The next two patches (#3, #4), improve the driver to better include the >> proper firmware syndrome in dmesg upon a failure in some firmware >> commands. >> >> The last patch follows the device specification to let the firmware know >> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, >> migration cancellation, etc.). >> >> This will let the firmware clean its internal resources that were turned >> on upon PRE_COPY. >> >> Note: >> As the first patch should go to net/mlx5, we may need to send it as a >> pull request format to vfio before acceptance of the series, to avoid >> conflicts. >> >> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- >> yishaih@nvidia.com/ >> Patch #2: >> - Rename to use 'object changed' in some places to make it clearer. >> - Enhance the commit log to better clarify the usage/use case. >> >> The above was suggested by Tian, Kevin <kevin.tian@intel.com>. >> > > this series looks good to me except a small remark on patch2: We should be fine there, see my answer on V0. > > Reviewed-by: Kevin Tian <kevin.tian@intel.com> Thanks Kevin, for your reviewed-by. Yishai
On 06/02/2024 10:06, Yishai Hadas wrote: > On 06/02/2024 9:35, Tian, Kevin wrote: >>> From: Yishai Hadas <yishaih@nvidia.com> >>> Sent: Monday, February 5, 2024 8:48 PM >>> >>> This series improves the mlx5 driver to better handle some error cases >>> as of below. >>> >>> The first two patches let the driver recognize whether the firmware >>> moved the tracker object to an error state. In that case, the driver >>> will skip/block any usage of that object. >>> >>> The next two patches (#3, #4), improve the driver to better include the >>> proper firmware syndrome in dmesg upon a failure in some firmware >>> commands. >>> >>> The last patch follows the device specification to let the firmware know >>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, >>> migration cancellation, etc.). >>> >>> This will let the firmware clean its internal resources that were turned >>> on upon PRE_COPY. >>> >>> Note: >>> As the first patch should go to net/mlx5, we may need to send it as a >>> pull request format to vfio before acceptance of the series, to avoid >>> conflicts. >>> >>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- >>> yishaih@nvidia.com/ >>> Patch #2: >>> - Rename to use 'object changed' in some places to make it clearer. >>> - Enhance the commit log to better clarify the usage/use case. >>> >>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>. >>> >> >> this series looks good to me except a small remark on patch2: > > We should be fine there, see my answer on V0. > >> >> Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > Thanks Kevin, for your reviewed-by. > > Yishai > Alex Are we OK here to continue with a PR for the first patch ? It seems that we should be fine here. Thanks, Yishai
On 08/02/2024 10:16, Yishai Hadas wrote: > On 06/02/2024 10:06, Yishai Hadas wrote: >> On 06/02/2024 9:35, Tian, Kevin wrote: >>>> From: Yishai Hadas <yishaih@nvidia.com> >>>> Sent: Monday, February 5, 2024 8:48 PM >>>> >>>> This series improves the mlx5 driver to better handle some error cases >>>> as of below. >>>> >>>> The first two patches let the driver recognize whether the firmware >>>> moved the tracker object to an error state. In that case, the driver >>>> will skip/block any usage of that object. >>>> >>>> The next two patches (#3, #4), improve the driver to better include the >>>> proper firmware syndrome in dmesg upon a failure in some firmware >>>> commands. >>>> >>>> The last patch follows the device specification to let the firmware >>>> know >>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, >>>> migration cancellation, etc.). >>>> >>>> This will let the firmware clean its internal resources that were >>>> turned >>>> on upon PRE_COPY. >>>> >>>> Note: >>>> As the first patch should go to net/mlx5, we may need to send it as a >>>> pull request format to vfio before acceptance of the series, to avoid >>>> conflicts. >>>> >>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- >>>> yishaih@nvidia.com/ >>>> Patch #2: >>>> - Rename to use 'object changed' in some places to make it clearer. >>>> - Enhance the commit log to better clarify the usage/use case. >>>> >>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>. >>>> >>> >>> this series looks good to me except a small remark on patch2: >> >> We should be fine there, see my answer on V0. >> >>> >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com> >> >> Thanks Kevin, for your reviewed-by. >> >> Yishai >> > > Alex > > Are we OK here to continue with a PR for the first patch ? > > It seems that we should be fine here. > > Thanks, > Yishai > Hi Alex, Any update here ? Thanks, Yishai
On Wed, 21 Feb 2024 09:45:14 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > On 08/02/2024 10:16, Yishai Hadas wrote: > > On 06/02/2024 10:06, Yishai Hadas wrote: > >> On 06/02/2024 9:35, Tian, Kevin wrote: > >>>> From: Yishai Hadas <yishaih@nvidia.com> > >>>> Sent: Monday, February 5, 2024 8:48 PM > >>>> > >>>> This series improves the mlx5 driver to better handle some error cases > >>>> as of below. > >>>> > >>>> The first two patches let the driver recognize whether the firmware > >>>> moved the tracker object to an error state. In that case, the driver > >>>> will skip/block any usage of that object. > >>>> > >>>> The next two patches (#3, #4), improve the driver to better include the > >>>> proper firmware syndrome in dmesg upon a failure in some firmware > >>>> commands. > >>>> > >>>> The last patch follows the device specification to let the firmware > >>>> know > >>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, > >>>> migration cancellation, etc.). > >>>> > >>>> This will let the firmware clean its internal resources that were > >>>> turned > >>>> on upon PRE_COPY. > >>>> > >>>> Note: > >>>> As the first patch should go to net/mlx5, we may need to send it as a > >>>> pull request format to vfio before acceptance of the series, to avoid > >>>> conflicts. > >>>> > >>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- > >>>> yishaih@nvidia.com/ > >>>> Patch #2: > >>>> - Rename to use 'object changed' in some places to make it clearer. > >>>> - Enhance the commit log to better clarify the usage/use case. > >>>> > >>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>. > >>>> > >>> > >>> this series looks good to me except a small remark on patch2: > >> > >> We should be fine there, see my answer on V0. > >> > >>> > >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com> > >> > >> Thanks Kevin, for your reviewed-by. > >> > >> Yishai > >> > > > > Alex > > > > Are we OK here to continue with a PR for the first patch ? > > > > It seems that we should be fine here. > > > > Thanks, > > Yishai > > > > Hi Alex, > Any update here ? Sure, if Leon wants to do a PR for struct mlx5_ifc_query_page_track_obj_out_bits, that's fine. The series looks ok to me. The struct definition is small enough to go through the vfio tree with Leon's ack, but I'll leave it to you to do the right thing relative to potential conflicts. Thanks, Alex
On Thu, Feb 22, 2024 at 11:04:05AM -0700, Alex Williamson wrote: > On Wed, 21 Feb 2024 09:45:14 +0200 > Yishai Hadas <yishaih@nvidia.com> wrote: > > > On 08/02/2024 10:16, Yishai Hadas wrote: > > > On 06/02/2024 10:06, Yishai Hadas wrote: > > >> On 06/02/2024 9:35, Tian, Kevin wrote: > > >>>> From: Yishai Hadas <yishaih@nvidia.com> > > >>>> Sent: Monday, February 5, 2024 8:48 PM > > >>>> > > >>>> This series improves the mlx5 driver to better handle some error cases > > >>>> as of below. > > >>>> > > >>>> The first two patches let the driver recognize whether the firmware > > >>>> moved the tracker object to an error state. In that case, the driver > > >>>> will skip/block any usage of that object. > > >>>> > > >>>> The next two patches (#3, #4), improve the driver to better include the > > >>>> proper firmware syndrome in dmesg upon a failure in some firmware > > >>>> commands. > > >>>> > > >>>> The last patch follows the device specification to let the firmware > > >>>> know > > >>>> upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, > > >>>> migration cancellation, etc.). > > >>>> > > >>>> This will let the firmware clean its internal resources that were > > >>>> turned > > >>>> on upon PRE_COPY. > > >>>> > > >>>> Note: > > >>>> As the first patch should go to net/mlx5, we may need to send it as a > > >>>> pull request format to vfio before acceptance of the series, to avoid > > >>>> conflicts. > > >>>> > > >>>> Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1- > > >>>> yishaih@nvidia.com/ > > >>>> Patch #2: > > >>>> - Rename to use 'object changed' in some places to make it clearer. > > >>>> - Enhance the commit log to better clarify the usage/use case. > > >>>> > > >>>> The above was suggested by Tian, Kevin <kevin.tian@intel.com>. > > >>>> > > >>> > > >>> this series looks good to me except a small remark on patch2: > > >> > > >> We should be fine there, see my answer on V0. > > >> > > >>> > > >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com> > > >> > > >> Thanks Kevin, for your reviewed-by. > > >> > > >> Yishai > > >> > > > > > > Alex > > > > > > Are we OK here to continue with a PR for the first patch ? > > > > > > It seems that we should be fine here. > > > > > > Thanks, > > > Yishai > > > > > > > Hi Alex, > > Any update here ? > > Sure, if Leon wants to do a PR for struct > mlx5_ifc_query_page_track_obj_out_bits, that's fine. The series looks > ok to me. The struct definition is small enough to go through the vfio > tree with Leon's ack, but I'll leave it to you to do the right thing > relative to potential conflicts. Thanks, Alex, you are right, there is no need to send a PR for the first patch. Please take it directly through your tree. We don't have anything in our shared branch this cycle. Acked-by: Leon Romanovsky <leon@kernel.org> Thanks > > Alex >
On Mon, 5 Feb 2024 14:48:23 +0200 Yishai Hadas <yishaih@nvidia.com> wrote: > This series improves the mlx5 driver to better handle some error cases > as of below. > > The first two patches let the driver recognize whether the firmware > moved the tracker object to an error state. In that case, the driver > will skip/block any usage of that object. > > The next two patches (#3, #4), improve the driver to better include the > proper firmware syndrome in dmesg upon a failure in some firmware > commands. > > The last patch follows the device specification to let the firmware know > upon leaving PRE_COPY back to RUNNING. (e.g. error in the target, > migration cancellation, etc.). > > This will let the firmware clean its internal resources that were turned > on upon PRE_COPY. > > Note: > As the first patch should go to net/mlx5, we may need to send it as a > pull request format to vfio before acceptance of the series, to avoid > conflicts. > > Changes from V0: https://lore.kernel.org/kvm/20240130170227.153464-1-yishaih@nvidia.com/ > Patch #2: > - Rename to use 'object changed' in some places to make it clearer. > - Enhance the commit log to better clarify the usage/use case. > > The above was suggested by Tian, Kevin <kevin.tian@intel.com>. > > Yishai > > Yishai Hadas (5): > net/mlx5: Add the IFC related bits for query tracker > vfio/mlx5: Add support for tracker object change event > vfio/mlx5: Handle the EREMOTEIO error upon the SAVE command > vfio/mlx5: Block incremental query upon migf state error > vfio/mlx5: Let firmware knows upon leaving PRE_COPY back to RUNNING > > drivers/vfio/pci/mlx5/cmd.c | 74 ++++++++++++++++++++++++++++++++--- > drivers/vfio/pci/mlx5/cmd.h | 5 ++- > drivers/vfio/pci/mlx5/main.c | 39 ++++++++++++++---- > include/linux/mlx5/mlx5_ifc.h | 5 +++ > 4 files changed, 110 insertions(+), 13 deletions(-) > Applied to vfio next branch for v6.9. Thanks, Alex