Message ID | 20221114125755.13659-1-michal.swiatkowski@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | resource management using devlink reload | expand |
On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > Currently the default value for number of PF vectors is number of CPUs. > Because of that there are cases when all vectors are used for PF > and user can't create more VFs. It is hard to set default number of > CPUs right for all different use cases. Instead allow user to choose > how many vectors should be used for various features. After implementing > subdevices this mechanism will be also used to set number of vectors > for subfunctions. > > The idea is to set vectors for eth or VFs using devlink resource API. > New value of vectors will be used after devlink reinit. Example > commands: > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > $ sudo devlink dev reload pci/0000:31:00.0 > After reload driver will work with 16 vectors used for eth instead of > num_cpus. By saying "vectors", are you referring to MSI-X vectors? If yes, you have specific interface for that. https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ Thanks
On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: >> Currently the default value for number of PF vectors is number of CPUs. >> Because of that there are cases when all vectors are used for PF >> and user can't create more VFs. It is hard to set default number of >> CPUs right for all different use cases. Instead allow user to choose >> how many vectors should be used for various features. After implementing >> subdevices this mechanism will be also used to set number of vectors >> for subfunctions. >> >> The idea is to set vectors for eth or VFs using devlink resource API. >> New value of vectors will be used after devlink reinit. Example >> commands: >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 >> $ sudo devlink dev reload pci/0000:31:00.0 >> After reload driver will work with 16 vectors used for eth instead of >> num_cpus. > By saying "vectors", are you referring to MSI-X vectors? > If yes, you have specific interface for that. > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ This patch series is exposing a resources API to split the device level MSI-X vectors across the different functions supported by the device (PF, RDMA, SR-IOV VFs and in future subfunctions). Today this is all hidden in a policy implemented within the PF driver. The patch you are referring to seems to be providing an interface to change the msix count for a particular VF. This patch is providing a interface to set the total msix count for all the possible VFs from the available device level pool of msix-vectors.
> -----Original Message----- > From: Samudrala, Sridhar <sridhar.samudrala@intel.com> > Sent: Monday, November 14, 2022 7:31 AM > To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski > <michal.swiatkowski@linux.intel.com> > Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org; > pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org; > jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin, > Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech > <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>; > Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse > <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>; > Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr > <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman, > David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek > <leszek.kaliszczuk@intel.com> > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > >> Currently the default value for number of PF vectors is number of CPUs. > >> Because of that there are cases when all vectors are used for PF > >> and user can't create more VFs. It is hard to set default number of > >> CPUs right for all different use cases. Instead allow user to choose > >> how many vectors should be used for various features. After implementing > >> subdevices this mechanism will be also used to set number of vectors > >> for subfunctions. > >> > >> The idea is to set vectors for eth or VFs using devlink resource API. > >> New value of vectors will be used after devlink reinit. Example > >> commands: > >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > >> $ sudo devlink dev reload pci/0000:31:00.0 > >> After reload driver will work with 16 vectors used for eth instead of > >> num_cpus. > > By saying "vectors", are you referring to MSI-X vectors? > > If yes, you have specific interface for that. > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > This patch series is exposing a resources API to split the device level MSI-X vectors > across the different functions supported by the device (PF, RDMA, SR-IOV VFs > and > in future subfunctions). Today this is all hidden in a policy implemented within > the PF driver. > > The patch you are referring to seems to be providing an interface to change the > msix count for a particular VF. This patch is providing a interface to set the total > msix count for all the possible VFs from the available device level pool of > msix-vectors. > It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs. Thanks, Jake
On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > Currently the default value for number of PF vectors is number of CPUs. > > > Because of that there are cases when all vectors are used for PF > > > and user can't create more VFs. It is hard to set default number of > > > CPUs right for all different use cases. Instead allow user to choose > > > how many vectors should be used for various features. After implementing > > > subdevices this mechanism will be also used to set number of vectors > > > for subfunctions. > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > New value of vectors will be used after devlink reinit. Example > > > commands: > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > After reload driver will work with 16 vectors used for eth instead of > > > num_cpus. > > By saying "vectors", are you referring to MSI-X vectors? > > If yes, you have specific interface for that. > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > This patch series is exposing a resources API to split the device level MSI-X vectors > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > in future subfunctions). Today this is all hidden in a policy implemented within > the PF driver. Maybe we are talking about different VFs, but if you refer to PCI VFs, the amount of MSI-X comes from PCI config space for that specific VF. You shouldn't set any value through netdev as it will cause to difference in output between lspci (which doesn't require any driver) and your newly set number. Also in RDMA case, it is not clear what will you achieve by this setting too. Thanks
On Mon, Nov 14, 2022 at 04:58:57PM +0000, Keller, Jacob E wrote: > > > > -----Original Message----- > > From: Samudrala, Sridhar <sridhar.samudrala@intel.com> > > Sent: Monday, November 14, 2022 7:31 AM > > To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski > > <michal.swiatkowski@linux.intel.com> > > Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org; > > pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org; > > jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin, > > Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech > > <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>; > > Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse > > <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>; > > Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr > > <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman, > > David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek > > <leszek.kaliszczuk@intel.com> > > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > >> Currently the default value for number of PF vectors is number of CPUs. > > >> Because of that there are cases when all vectors are used for PF > > >> and user can't create more VFs. It is hard to set default number of > > >> CPUs right for all different use cases. Instead allow user to choose > > >> how many vectors should be used for various features. After implementing > > >> subdevices this mechanism will be also used to set number of vectors > > >> for subfunctions. > > >> > > >> The idea is to set vectors for eth or VFs using devlink resource API. > > >> New value of vectors will be used after devlink reinit. Example > > >> commands: > > >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > >> $ sudo devlink dev reload pci/0000:31:00.0 > > >> After reload driver will work with 16 vectors used for eth instead of > > >> num_cpus. > > > By saying "vectors", are you referring to MSI-X vectors? > > > If yes, you have specific interface for that. > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs > > and > > in future subfunctions). Today this is all hidden in a policy implemented within > > the PF driver. > > > > The patch you are referring to seems to be providing an interface to change the > > msix count for a particular VF. This patch is providing a interface to set the total > > msix count for all the possible VFs from the available device level pool of > > msix-vectors. > > > > It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs. Yes, to be aligned with PCI spec and see coherent lspci output for VFs. Thanks > > Thanks, > Jake
On Mon, Nov 14, 2022 at 07:09:36PM +0200, Leon Romanovsky wrote: > On Mon, Nov 14, 2022 at 04:58:57PM +0000, Keller, Jacob E wrote: > > > > > > > -----Original Message----- > > > From: Samudrala, Sridhar <sridhar.samudrala@intel.com> > > > Sent: Monday, November 14, 2022 7:31 AM > > > To: Leon Romanovsky <leon@kernel.org>; Michal Swiatkowski > > > <michal.swiatkowski@linux.intel.com> > > > Cc: netdev@vger.kernel.org; davem@davemloft.net; kuba@kernel.org; > > > pabeni@redhat.com; edumazet@google.com; intel-wired-lan@lists.osuosl.org; > > > jiri@nvidia.com; Nguyen, Anthony L <anthony.l.nguyen@intel.com>; Lobakin, > > > Alexandr <alexandr.lobakin@intel.com>; Drewek, Wojciech > > > <wojciech.drewek@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>; > > > Saleem, Shiraz <shiraz.saleem@intel.com>; Brandeburg, Jesse > > > <jesse.brandeburg@intel.com>; Ismail, Mustafa <mustafa.ismail@intel.com>; > > > Kitszel, Przemyslaw <przemyslaw.kitszel@intel.com>; Raczynski, Piotr > > > <piotr.raczynski@intel.com>; Keller, Jacob E <jacob.e.keller@intel.com>; Ertman, > > > David M <david.m.ertman@intel.com>; Kaliszczuk, Leszek > > > <leszek.kaliszczuk@intel.com> > > > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > >> Currently the default value for number of PF vectors is number of CPUs. > > > >> Because of that there are cases when all vectors are used for PF > > > >> and user can't create more VFs. It is hard to set default number of > > > >> CPUs right for all different use cases. Instead allow user to choose > > > >> how many vectors should be used for various features. After implementing > > > >> subdevices this mechanism will be also used to set number of vectors > > > >> for subfunctions. > > > >> > > > >> The idea is to set vectors for eth or VFs using devlink resource API. > > > >> New value of vectors will be used after devlink reinit. Example > > > >> commands: > > > >> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > >> $ sudo devlink dev reload pci/0000:31:00.0 > > > >> After reload driver will work with 16 vectors used for eth instead of > > > >> num_cpus. > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > If yes, you have specific interface for that. > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs > > > and > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > the PF driver. > > > > > > The patch you are referring to seems to be providing an interface to change the > > > msix count for a particular VF. This patch is providing a interface to set the total > > > msix count for all the possible VFs from the available device level pool of > > > msix-vectors. > > > > > > > It looks like we should implement both: resources to configure the "pool" of available vectors for each VF, and the sysfs VF Interface to allow configuring individual VFs. > > Yes, to be aligned with PCI spec and see coherent lspci output for VFs. > > Thanks > Thanks, I will implement this sysf interface to configure individual VFs. > > > > Thanks, > > Jake
On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > Because of that there are cases when all vectors are used for PF > > > > and user can't create more VFs. It is hard to set default number of > > > > CPUs right for all different use cases. Instead allow user to choose > > > > how many vectors should be used for various features. After implementing > > > > subdevices this mechanism will be also used to set number of vectors > > > > for subfunctions. > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > New value of vectors will be used after devlink reinit. Example > > > > commands: > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > After reload driver will work with 16 vectors used for eth instead of > > > > num_cpus. > > > By saying "vectors", are you referring to MSI-X vectors? > > > If yes, you have specific interface for that. > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > in future subfunctions). Today this is all hidden in a policy implemented within > > the PF driver. > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > the amount of MSI-X comes from PCI config space for that specific VF. > > You shouldn't set any value through netdev as it will cause to > difference in output between lspci (which doesn't require any driver) > and your newly set number. If I understand correctly, lspci shows the MSI-X number for individual VF. Value set via devlink is the total number of MSI-X that can be used when creating VFs. As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > Also in RDMA case, it is not clear what will you achieve by this > setting too. > We have limited number of MSI-X (1024) in the device. Because of that the amount of MSI-X for each feature is set to the best values. Half for ethernet, half for RDMA. This patchset allow user to change this values. If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > Thanks Thanks for reviewing
On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > Because of that there are cases when all vectors are used for PF > > > > > and user can't create more VFs. It is hard to set default number of > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > how many vectors should be used for various features. After implementing > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > for subfunctions. > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > New value of vectors will be used after devlink reinit. Example > > > > > commands: > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > num_cpus. > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > If yes, you have specific interface for that. > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > the PF driver. > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > You shouldn't set any value through netdev as it will cause to > > difference in output between lspci (which doesn't require any driver) > > and your newly set number. > > If I understand correctly, lspci shows the MSI-X number for individual > VF. Value set via devlink is the total number of MSI-X that can be used > when creating VFs. Yes and no, lspci shows how much MSI-X vectors exist from HW point of view. Driver can use less than that. It is exactly as your proposed devlink interface. > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > Also in RDMA case, it is not clear what will you achieve by this > > setting too. > > > > We have limited number of MSI-X (1024) in the device. Because of that > the amount of MSI-X for each feature is set to the best values. Half for > ethernet, half for RDMA. This patchset allow user to change this values. > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. RDMA devices doesn't have PCI logic and everything is controlled through you main core module. It means that when you create RDMA auxiliary device, it will be connected to netdev (RoCE and iWARP) and that netdev should deal with vectors. So I still don't understand what does it mean "half for RDMA". Thanks
On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > how many vectors should be used for various features. After implementing > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > for subfunctions. > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > commands: > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > num_cpus. > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > If yes, you have specific interface for that. > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > the PF driver. > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > You shouldn't set any value through netdev as it will cause to > > > difference in output between lspci (which doesn't require any driver) > > > and your newly set number. > > > > If I understand correctly, lspci shows the MSI-X number for individual > > VF. Value set via devlink is the total number of MSI-X that can be used > > when creating VFs. > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > view. Driver can use less than that. It is exactly as your proposed > devlink interface. > > Ok, I have to take a closer look at it. So, are You saing that we should drop this devlink solution and use sysfs interface fo VFs or are You fine with having both? What with MSI-X allocation for subfunction? > > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > > > > Also in RDMA case, it is not clear what will you achieve by this > > > setting too. > > > > > > > We have limited number of MSI-X (1024) in the device. Because of that > > the amount of MSI-X for each feature is set to the best values. Half for > > ethernet, half for RDMA. This patchset allow user to change this values. > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > > RDMA devices doesn't have PCI logic and everything is controlled through > you main core module. It means that when you create RDMA auxiliary device, > it will be connected to netdev (RoCE and iWARP) and that netdev should > deal with vectors. So I still don't understand what does it mean "half > for RDMA". > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA are reserved and can't be used by ethernet. For example I have 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't be able to use it in ethernet and vice versa. By saing it can't be used I mean that irdma driver received the MSI-X vectors number and it is using them (connected them with RDMA interrupts). Devlink resource is a way to change the number of MSI-X vectors that will be reserved for RDMA. You wrote that netdev should deal with vectors, but how netdev will know how many vectors should go to RDMA aux device? Does there an interface for setting the vectors amount for RDMA device? Thanks > Thanks
On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > > how many vectors should be used for various features. After implementing > > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > > for subfunctions. > > > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > > commands: > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > > num_cpus. > > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > > If yes, you have specific interface for that. > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > > the PF driver. > > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > > > You shouldn't set any value through netdev as it will cause to > > > > difference in output between lspci (which doesn't require any driver) > > > > and your newly set number. > > > > > > If I understand correctly, lspci shows the MSI-X number for individual > > > VF. Value set via devlink is the total number of MSI-X that can be used > > > when creating VFs. > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > > view. Driver can use less than that. It is exactly as your proposed > > devlink interface. > > > > > > Ok, I have to take a closer look at it. So, are You saing that we should > drop this devlink solution and use sysfs interface fo VFs or are You > fine with having both? What with MSI-X allocation for subfunction? You should drop for VFs and PFs and keep it for SFs only. > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > > > > > > > Also in RDMA case, it is not clear what will you achieve by this > > > > setting too. > > > > > > > > > > We have limited number of MSI-X (1024) in the device. Because of that > > > the amount of MSI-X for each feature is set to the best values. Half for > > > ethernet, half for RDMA. This patchset allow user to change this values. > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > > > > RDMA devices doesn't have PCI logic and everything is controlled through > > you main core module. It means that when you create RDMA auxiliary device, > > it will be connected to netdev (RoCE and iWARP) and that netdev should > > deal with vectors. So I still don't understand what does it mean "half > > for RDMA". > > > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA > are reserved and can't be used by ethernet. For example I have > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't > be able to use it in ethernet and vice versa. > > By saing it can't be used I mean that irdma driver received the MSI-X > vectors number and it is using them (connected them with RDMA interrupts). > > Devlink resource is a way to change the number of MSI-X vectors that > will be reserved for RDMA. You wrote that netdev should deal with > vectors, but how netdev will know how many vectors should go to RDMA aux > device? Does there an interface for setting the vectors amount for RDMA > device? When RDMA core adds device, it calls to irdma_init_rdma_device() and num_comp_vectors is actually the number of MSI-X vectors which you want to give to that device. I'm trying to say that probably you shouldn't reserve specific vectors for both ethernet and RDMA and simply share same vectors. RDMA applications that care about performance set comp_vector through user space verbs anyway. Thanks > > Thanks > > > Thanks
On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote: > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > > > how many vectors should be used for various features. After implementing > > > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > > > for subfunctions. > > > > > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > > > commands: > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > > > num_cpus. > > > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > > > If yes, you have specific interface for that. > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > > > the PF driver. > > > > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > > > > > You shouldn't set any value through netdev as it will cause to > > > > > difference in output between lspci (which doesn't require any driver) > > > > > and your newly set number. > > > > > > > > If I understand correctly, lspci shows the MSI-X number for individual > > > > VF. Value set via devlink is the total number of MSI-X that can be used > > > > when creating VFs. > > > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > > > view. Driver can use less than that. It is exactly as your proposed > > > devlink interface. > > > > > > > > > > Ok, I have to take a closer look at it. So, are You saing that we should > > drop this devlink solution and use sysfs interface fo VFs or are You > > fine with having both? What with MSI-X allocation for subfunction? > > You should drop for VFs and PFs and keep it for SFs only. > I understand that MSI-X for VFs can be set via sysfs interface, but what with PFs? Should we always allow max MSI-X for PFs? So hw_max - used - sfs? Is it save to call pci_enable_msix always with max vectors supported by device? I added the value for PFs, because we faced a problem with MSI-X allocation on 8 port device. Our default value (num_cpus) was too big (not enough vectors in hw). Changing the amount of vectors that can be used on PFs was solving the issue. Let me write an example. As default MSI-X for PF is set to num_cpus, the platform have 128 CPUs, we have 8 port device installed there and still have 1024 vectors in HW (I simplified because I don't count additional interrupts). We run out of vectors, there is 0 vectors that can be used for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2 and will end with 512 vectors for VFs. I assume that with current sysfs interface in this situation MSI-X for VFs can be set from 0 to 512? What if user wants more? If there is a PFs MSI-X value which can be set by user, user can decrease the value and use more vectors for VFs. Is it possible in current VFs sysfs interface? I mean, setting VFs MSI-X vectors to value that will need to decrease MSI-X for PFs. > > > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > > > > > > > > > > Also in RDMA case, it is not clear what will you achieve by this > > > > > setting too. > > > > > > > > > > > > > We have limited number of MSI-X (1024) in the device. Because of that > > > > the amount of MSI-X for each feature is set to the best values. Half for > > > > ethernet, half for RDMA. This patchset allow user to change this values. > > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > > > > > > RDMA devices doesn't have PCI logic and everything is controlled through > > > you main core module. It means that when you create RDMA auxiliary device, > > > it will be connected to netdev (RoCE and iWARP) and that netdev should > > > deal with vectors. So I still don't understand what does it mean "half > > > for RDMA". > > > > > > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA > > are reserved and can't be used by ethernet. For example I have > > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for > > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't > > be able to use it in ethernet and vice versa. > > > > By saing it can't be used I mean that irdma driver received the MSI-X > > vectors number and it is using them (connected them with RDMA interrupts). > > > > Devlink resource is a way to change the number of MSI-X vectors that > > will be reserved for RDMA. You wrote that netdev should deal with > > vectors, but how netdev will know how many vectors should go to RDMA aux > > device? Does there an interface for setting the vectors amount for RDMA > > device? > > When RDMA core adds device, it calls to irdma_init_rdma_device() and > num_comp_vectors is actually the number of MSI-X vectors which you want > to give to that device. > > I'm trying to say that probably you shouldn't reserve specific vectors > for both ethernet and RDMA and simply share same vectors. RDMA applications > that care about performance set comp_vector through user space verbs anyway. > Thanks for explanation, appriciate that. In our driver num_comp_vectors for RDMA is set during driver probe. Do we have any interface to change num_comp_vectors while driver is working? Sorry, I am not fully familiar with RDMA. Can user app for RDMA set comp_vector to any value or only to max which is num_comp_vectors given for RDMA while creating aux device? Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to 64, how I will know if I can or can't use these vectors in ethernet? Thanks > Thanks > > > > > Thanks > > > > > Thanks
On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote: > On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote: > > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: > > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > > > > how many vectors should be used for various features. After implementing > > > > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > > > > for subfunctions. > > > > > > > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > > > > commands: > > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > > > > num_cpus. > > > > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > > > > If yes, you have specific interface for that. > > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > > > > the PF driver. > > > > > > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > > > > > > > You shouldn't set any value through netdev as it will cause to > > > > > > difference in output between lspci (which doesn't require any driver) > > > > > > and your newly set number. > > > > > > > > > > If I understand correctly, lspci shows the MSI-X number for individual > > > > > VF. Value set via devlink is the total number of MSI-X that can be used > > > > > when creating VFs. > > > > > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > > > > view. Driver can use less than that. It is exactly as your proposed > > > > devlink interface. > > > > > > > > > > > > > > Ok, I have to take a closer look at it. So, are You saing that we should > > > drop this devlink solution and use sysfs interface fo VFs or are You > > > fine with having both? What with MSI-X allocation for subfunction? > > > > You should drop for VFs and PFs and keep it for SFs only. > > > > I understand that MSI-X for VFs can be set via sysfs interface, but what > with PFs? PFs are even more tricker than VFs, as you are changing that number while driver is bound. This makes me wonder what will be lspci output, as you will need to show right number before driver starts to load. You need to present right value if user decided to unbind driver from PF too. > Should we always allow max MSI-X for PFs? So hw_max - used - > sfs? Is it save to call pci_enable_msix always with max vectors > supported by device? I'm not sure. I think that it won't give you much if you enable more than num_online_cpu(). > > I added the value for PFs, because we faced a problem with MSI-X > allocation on 8 port device. Our default value (num_cpus) was too big > (not enough vectors in hw). Changing the amount of vectors that can be > used on PFs was solving the issue. We had something similar for mlx5 SFs, where we don't have enough vectors. Our solution is simply to move to automatic shared MSI-X mode. I would advocate for that for you as well. > > Let me write an example. As default MSI-X for PF is set to num_cpus, the > platform have 128 CPUs, we have 8 port device installed there and still > have 1024 vectors in HW (I simplified because I don't count additional > interrupts). We run out of vectors, there is 0 vectors that can be used > for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2 > and will end with 512 vectors for VFs. I assume that with current sysfs > interface in this situation MSI-X for VFs can be set from 0 to 512? What > if user wants more? If there is a PFs MSI-X value which can be set by > user, user can decrease the value and use more vectors for VFs. Is it > possible in current VFs sysfs interface? I mean, setting VFs MSI-X > vectors to value that will need to decrease MSI-X for PFs. You can't do it and this limitation is because PF is bound. You can't change that number while driver is running. AFAIR, such change will be PCI spec violation. > > > > > > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > > > > > > > > > > > > > Also in RDMA case, it is not clear what will you achieve by this > > > > > > setting too. > > > > > > > > > > > > > > > > We have limited number of MSI-X (1024) in the device. Because of that > > > > > the amount of MSI-X for each feature is set to the best values. Half for > > > > > ethernet, half for RDMA. This patchset allow user to change this values. > > > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > > > > > > > > RDMA devices doesn't have PCI logic and everything is controlled through > > > > you main core module. It means that when you create RDMA auxiliary device, > > > > it will be connected to netdev (RoCE and iWARP) and that netdev should > > > > deal with vectors. So I still don't understand what does it mean "half > > > > for RDMA". > > > > > > > > > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA > > > are reserved and can't be used by ethernet. For example I have > > > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for > > > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't > > > be able to use it in ethernet and vice versa. > > > > > > By saing it can't be used I mean that irdma driver received the MSI-X > > > vectors number and it is using them (connected them with RDMA interrupts). > > > > > > Devlink resource is a way to change the number of MSI-X vectors that > > > will be reserved for RDMA. You wrote that netdev should deal with > > > vectors, but how netdev will know how many vectors should go to RDMA aux > > > device? Does there an interface for setting the vectors amount for RDMA > > > device? > > > > When RDMA core adds device, it calls to irdma_init_rdma_device() and > > num_comp_vectors is actually the number of MSI-X vectors which you want > > to give to that device. > > > > I'm trying to say that probably you shouldn't reserve specific vectors > > for both ethernet and RDMA and simply share same vectors. RDMA applications > > that care about performance set comp_vector through user space verbs anyway. > > > > Thanks for explanation, appriciate that. In our driver num_comp_vectors for > RDMA is set during driver probe. Do we have any interface to change > num_comp_vectors while driver is working? No, as this number is indirectly exposed to the user space. > > Sorry, I am not fully familiar with RDMA. Can user app for RDMA set > comp_vector to any value or only to max which is num_comp_vectors given > for RDMA while creating aux device? comp_vector logically equal to IRQ number and this is how RDMA applications controls to which interrupt deliver completions. " The CQ will use the completion vector comp_vector for signaling completion events; it must be at least zero and less than context->num_comp_vectors. " https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html > > Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to > 64, how I will know if I can or can't use these vectors in ethernet? Why should you need to know? Vectors are not exclusive and they can be used by many applications at the same time. The thing is that it is far fetch to except that high performance RDMA applications and high performance ethernet can coexist on same device at the same time. Thanks > > Thanks > > > Thanks > > > > > > > > Thanks > > > > > > > Thanks
On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote: > On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote: > > On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote: > > > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: > > > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > > > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > > > > > how many vectors should be used for various features. After implementing > > > > > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > > > > > for subfunctions. > > > > > > > > > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > > > > > commands: > > > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > > > > > num_cpus. > > > > > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > > > > > If yes, you have specific interface for that. > > > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > > > > > the PF driver. > > > > > > > > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > > > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > > > > > > > > > You shouldn't set any value through netdev as it will cause to > > > > > > > difference in output between lspci (which doesn't require any driver) > > > > > > > and your newly set number. > > > > > > > > > > > > If I understand correctly, lspci shows the MSI-X number for individual > > > > > > VF. Value set via devlink is the total number of MSI-X that can be used > > > > > > when creating VFs. > > > > > > > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > > > > > view. Driver can use less than that. It is exactly as your proposed > > > > > devlink interface. > > > > > > > > > > > > > > > > > > Ok, I have to take a closer look at it. So, are You saing that we should > > > > drop this devlink solution and use sysfs interface fo VFs or are You > > > > fine with having both? What with MSI-X allocation for subfunction? > > > > > > You should drop for VFs and PFs and keep it for SFs only. > > > > > > > I understand that MSI-X for VFs can be set via sysfs interface, but what > > with PFs? > > PFs are even more tricker than VFs, as you are changing that number > while driver is bound. This makes me wonder what will be lspci output, > as you will need to show right number before driver starts to load. > > You need to present right value if user decided to unbind driver from PF too. > In case of ice driver lspci -vs shows: Capabilities: [70] MSI-X: Enable+ Count=1024 Masked so all vectors that hw supports (PFs, VFs, misc, etc). Because of that total number of MSI-X in the devlink example from cover letter is 1024. I see that mellanox shows: Capabilities: [9c] MSI-X: Enable+ Count=64 Masked I assume that 64 is in this case MSI-X ony for this one PF (it make sense). To be honest I don't know why we show maximum MSI-X for the device there, but because of that the value will be the same afer changing allocation of MSI-X across features. Isn't the MSI-X capabilities read from HW register? > > Should we always allow max MSI-X for PFs? So hw_max - used - > > sfs? Is it save to call pci_enable_msix always with max vectors > > supported by device? > > I'm not sure. I think that it won't give you much if you enable > more than num_online_cpu(). > Oh, yes, correct, I missed that. > > > > I added the value for PFs, because we faced a problem with MSI-X > > allocation on 8 port device. Our default value (num_cpus) was too big > > (not enough vectors in hw). Changing the amount of vectors that can be > > used on PFs was solving the issue. > > We had something similar for mlx5 SFs, where we don't have enough vectors. > Our solution is simply to move to automatic shared MSI-X mode. I would > advocate for that for you as well. > Thanks for proposing solution, I will take a look how this work in mlx5. > > > > Let me write an example. As default MSI-X for PF is set to num_cpus, the > > platform have 128 CPUs, we have 8 port device installed there and still > > have 1024 vectors in HW (I simplified because I don't count additional > > interrupts). We run out of vectors, there is 0 vectors that can be used > > for VFs. Sure, it is easy to handle, we can divide PFs interrupts by 2 > > and will end with 512 vectors for VFs. I assume that with current sysfs > > interface in this situation MSI-X for VFs can be set from 0 to 512? What > > if user wants more? If there is a PFs MSI-X value which can be set by > > user, user can decrease the value and use more vectors for VFs. Is it > > possible in current VFs sysfs interface? I mean, setting VFs MSI-X > > vectors to value that will need to decrease MSI-X for PFs. > > You can't do it and this limitation is because PF is bound. You can't change > that number while driver is running. AFAIR, such change will be PCI spec > violation. > As in previous comment, we have different value in MSI-X capabilities field (max number of vectors), so I won't allow user to change this value. I don't know why it is done like that (MSI-X amount for whole device in PCI caps). I will try to dig into it. > > > > > > > > > > > > As Jake said I will fix the code to track both values. Thanks for pointing the patch. > > > > > > > > > > > > > > > > > > > > Also in RDMA case, it is not clear what will you achieve by this > > > > > > > setting too. > > > > > > > > > > > > > > > > > > > We have limited number of MSI-X (1024) in the device. Because of that > > > > > > the amount of MSI-X for each feature is set to the best values. Half for > > > > > > ethernet, half for RDMA. This patchset allow user to change this values. > > > > > > If he wants more MSI-X for ethernet, he can decrease MSI-X for RDMA. > > > > > > > > > > RDMA devices doesn't have PCI logic and everything is controlled through > > > > > you main core module. It means that when you create RDMA auxiliary device, > > > > > it will be connected to netdev (RoCE and iWARP) and that netdev should > > > > > deal with vectors. So I still don't understand what does it mean "half > > > > > for RDMA". > > > > > > > > > > > > > Yes, it is controlled by module, but during probe, MSI-X vectors for RDMA > > > > are reserved and can't be used by ethernet. For example I have > > > > 64 CPUs, when loading I get 64 vectors from HW for ethernet and 64 for > > > > RDMA. The vectors for RDMA will be consumed by irdma driver, so I won't > > > > be able to use it in ethernet and vice versa. > > > > > > > > By saing it can't be used I mean that irdma driver received the MSI-X > > > > vectors number and it is using them (connected them with RDMA interrupts). > > > > > > > > Devlink resource is a way to change the number of MSI-X vectors that > > > > will be reserved for RDMA. You wrote that netdev should deal with > > > > vectors, but how netdev will know how many vectors should go to RDMA aux > > > > device? Does there an interface for setting the vectors amount for RDMA > > > > device? > > > > > > When RDMA core adds device, it calls to irdma_init_rdma_device() and > > > num_comp_vectors is actually the number of MSI-X vectors which you want > > > to give to that device. > > > > > > I'm trying to say that probably you shouldn't reserve specific vectors > > > for both ethernet and RDMA and simply share same vectors. RDMA applications > > > that care about performance set comp_vector through user space verbs anyway. > > > > > > > Thanks for explanation, appriciate that. In our driver num_comp_vectors for > > RDMA is set during driver probe. Do we have any interface to change > > num_comp_vectors while driver is working? > > No, as this number is indirectly exposed to the user space. > Ok, thanks. > > > > Sorry, I am not fully familiar with RDMA. Can user app for RDMA set > > comp_vector to any value or only to max which is num_comp_vectors given > > for RDMA while creating aux device? > > comp_vector logically equal to IRQ number and this is how RDMA > applications controls to which interrupt deliver completions. > > " The CQ will use the completion vector comp_vector for signaling > completion events; it must be at least zero and less than > context->num_comp_vectors. " > https://man7.org/linux/man-pages/man3/ibv_create_cq.3.html > Thank You > > > > Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to > > 64, how I will know if I can or can't use these vectors in ethernet? > > Why should you need to know? Vectors are not exclusive and they can be > used by many applications at the same time. The thing is that it is far > fetch to except that high performance RDMA applications and high > performance ethernet can coexist on same device at the same time. > Yes, but after loading aux device part of vectors (num_comp_vectors) are reserved for only RDMA use case (at least in ice driver). We though that devlink resource interface can be a good way to change the num_comp_vectors in this case. Maybe I am wrong, but I think that vectors that we reserved for RDMA can't be used by ethernet (in ice driver). We sent specific MSI-X entries to rdma driver, so I don't think that the same entries can be reuse by ethernet. I am talking only about ice + irdma case, am I wrong (probably question to Intel devs)? Huge thanks for Your comments here, I understnad more now. > Thanks > > > > > Thanks > > > > > Thanks > > > > > > > > > > > Thanks > > > > > > > > > Thanks
On Tue, Nov 15, 2022 at 03:02:40PM +0100, Michal Swiatkowski wrote: > On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote: > > On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote: > > > On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote: > > > > On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: > > > > > On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: > > > > > > On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: > > > > > > > On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: > > > > > > > > On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: > > > > > > > > > On 11/14/2022 7:23 AM, Leon Romanovsky wrote: > > > > > > > > > > On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: > > > > > > > > > > > Currently the default value for number of PF vectors is number of CPUs. > > > > > > > > > > > Because of that there are cases when all vectors are used for PF > > > > > > > > > > > and user can't create more VFs. It is hard to set default number of > > > > > > > > > > > CPUs right for all different use cases. Instead allow user to choose > > > > > > > > > > > how many vectors should be used for various features. After implementing > > > > > > > > > > > subdevices this mechanism will be also used to set number of vectors > > > > > > > > > > > for subfunctions. > > > > > > > > > > > > > > > > > > > > > > The idea is to set vectors for eth or VFs using devlink resource API. > > > > > > > > > > > New value of vectors will be used after devlink reinit. Example > > > > > > > > > > > commands: > > > > > > > > > > > $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 > > > > > > > > > > > $ sudo devlink dev reload pci/0000:31:00.0 > > > > > > > > > > > After reload driver will work with 16 vectors used for eth instead of > > > > > > > > > > > num_cpus. > > > > > > > > > > By saying "vectors", are you referring to MSI-X vectors? > > > > > > > > > > If yes, you have specific interface for that. > > > > > > > > > > https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ > > > > > > > > > > > > > > > > > > This patch series is exposing a resources API to split the device level MSI-X vectors > > > > > > > > > across the different functions supported by the device (PF, RDMA, SR-IOV VFs and > > > > > > > > > in future subfunctions). Today this is all hidden in a policy implemented within > > > > > > > > > the PF driver. > > > > > > > > > > > > > > > > Maybe we are talking about different VFs, but if you refer to PCI VFs, > > > > > > > > the amount of MSI-X comes from PCI config space for that specific VF. > > > > > > > > > > > > > > > > You shouldn't set any value through netdev as it will cause to > > > > > > > > difference in output between lspci (which doesn't require any driver) > > > > > > > > and your newly set number. > > > > > > > > > > > > > > If I understand correctly, lspci shows the MSI-X number for individual > > > > > > > VF. Value set via devlink is the total number of MSI-X that can be used > > > > > > > when creating VFs. > > > > > > > > > > > > Yes and no, lspci shows how much MSI-X vectors exist from HW point of > > > > > > view. Driver can use less than that. It is exactly as your proposed > > > > > > devlink interface. > > > > > > > > > > > > > > > > > > > > > > Ok, I have to take a closer look at it. So, are You saing that we should > > > > > drop this devlink solution and use sysfs interface fo VFs or are You > > > > > fine with having both? What with MSI-X allocation for subfunction? > > > > > > > > You should drop for VFs and PFs and keep it for SFs only. > > > > > > > > > > I understand that MSI-X for VFs can be set via sysfs interface, but what > > > with PFs? > > > > PFs are even more tricker than VFs, as you are changing that number > > while driver is bound. This makes me wonder what will be lspci output, > > as you will need to show right number before driver starts to load. > > > > You need to present right value if user decided to unbind driver from PF too. > > > > In case of ice driver lspci -vs shows: > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > total number of MSI-X in the devlink example from cover letter is 1024. > > I see that mellanox shows: > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > I assume that 64 is in this case MSI-X ony for this one PF (it make > sense). Yes and PF MSI-X count can be changed through FW configuration tool, as we need to write new value when the driver is unbound and we need it to be persistent. Users are expecting to see "stable" number any time they reboot the server. It is not the case for VFs, as they are explicitly created after reboots and start "fresh" after every boot. So we set large enough but not too large value as a default for PFs. If you find sane model of how to change it through kernel, you can count on our support. > To be honest I don't know why we show maximum MSI-X for the device > there, but because of that the value will be the same afer changing > allocation of MSI-X across features. > > Isn't the MSI-X capabilities read from HW register? Yes, it comes from Message Control register. > > > > Should we always allow max MSI-X for PFs? So hw_max - used - > > > sfs? Is it save to call pci_enable_msix always with max vectors > > > supported by device? > > > > I'm not sure. I think that it won't give you much if you enable > > more than num_online_cpu(). > > > > Oh, yes, correct, I missed that. > > > > > > > I added the value for PFs, because we faced a problem with MSI-X > > > allocation on 8 port device. Our default value (num_cpus) was too big > > > (not enough vectors in hw). Changing the amount of vectors that can be > > > used on PFs was solving the issue. > > > > We had something similar for mlx5 SFs, where we don't have enough vectors. > > Our solution is simply to move to automatic shared MSI-X mode. I would > > advocate for that for you as well. > > > > Thanks for proposing solution, I will take a look how this work in mlx5. BTW, PCI spec talks about something like that in short paragraph "Handling MSI-X Vector Shortages". <...> > > > > > > Assuming that I gave 64 MSI-X for RDMA by setting num_comp_vectors to > > > 64, how I will know if I can or can't use these vectors in ethernet? > > > > Why should you need to know? Vectors are not exclusive and they can be > > used by many applications at the same time. The thing is that it is far > > fetch to except that high performance RDMA applications and high > > performance ethernet can coexist on same device at the same time. > > > > Yes, but after loading aux device part of vectors (num_comp_vectors) are > reserved for only RDMA use case (at least in ice driver). We though that > devlink resource interface can be a good way to change the > num_comp_vectors in this case. It can be, but is much better to save from users devlink magic. :) Thanks
On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > On Tue, Nov 15, 2022 at 03:02:40PM +0100, Michal Swiatkowski wrote: >> On Tue, Nov 15, 2022 at 02:12:12PM +0200, Leon Romanovsky wrote: >>> On Tue, Nov 15, 2022 at 11:16:58AM +0100, Michal Swiatkowski wrote: >>>> On Tue, Nov 15, 2022 at 11:32:14AM +0200, Leon Romanovsky wrote: >>>>> On Tue, Nov 15, 2022 at 10:04:49AM +0100, Michal Swiatkowski wrote: >>>>>> On Tue, Nov 15, 2022 at 10:11:10AM +0200, Leon Romanovsky wrote: >>>>>>> On Tue, Nov 15, 2022 at 08:12:52AM +0100, Michal Swiatkowski wrote: >>>>>>>> On Mon, Nov 14, 2022 at 07:07:54PM +0200, Leon Romanovsky wrote: >>>>>>>>> On Mon, Nov 14, 2022 at 09:31:11AM -0600, Samudrala, Sridhar wrote: >>>>>>>>>> On 11/14/2022 7:23 AM, Leon Romanovsky wrote: >>>>>>>>>>> On Mon, Nov 14, 2022 at 01:57:42PM +0100, Michal Swiatkowski wrote: >>>>>>>>>>>> Currently the default value for number of PF vectors is number of CPUs. >>>>>>>>>>>> Because of that there are cases when all vectors are used for PF >>>>>>>>>>>> and user can't create more VFs. It is hard to set default number of >>>>>>>>>>>> CPUs right for all different use cases. Instead allow user to choose >>>>>>>>>>>> how many vectors should be used for various features. After implementing >>>>>>>>>>>> subdevices this mechanism will be also used to set number of vectors >>>>>>>>>>>> for subfunctions. >>>>>>>>>>>> >>>>>>>>>>>> The idea is to set vectors for eth or VFs using devlink resource API. >>>>>>>>>>>> New value of vectors will be used after devlink reinit. Example >>>>>>>>>>>> commands: >>>>>>>>>>>> $ sudo devlink resource set pci/0000:31:00.0 path msix/msix_eth size 16 >>>>>>>>>>>> $ sudo devlink dev reload pci/0000:31:00.0 >>>>>>>>>>>> After reload driver will work with 16 vectors used for eth instead of >>>>>>>>>>>> num_cpus. >>>>>>>>>>> By saying "vectors", are you referring to MSI-X vectors? >>>>>>>>>>> If yes, you have specific interface for that. >>>>>>>>>>> https://lore.kernel.org/linux-pci/20210314124256.70253-1-leon@kernel.org/ >>>>>>>>>> This patch series is exposing a resources API to split the device level MSI-X vectors >>>>>>>>>> across the different functions supported by the device (PF, RDMA, SR-IOV VFs and >>>>>>>>>> in future subfunctions). Today this is all hidden in a policy implemented within >>>>>>>>>> the PF driver. >>>>>>>>> Maybe we are talking about different VFs, but if you refer to PCI VFs, >>>>>>>>> the amount of MSI-X comes from PCI config space for that specific VF. >>>>>>>>> >>>>>>>>> You shouldn't set any value through netdev as it will cause to >>>>>>>>> difference in output between lspci (which doesn't require any driver) >>>>>>>>> and your newly set number. >>>>>>>> If I understand correctly, lspci shows the MSI-X number for individual >>>>>>>> VF. Value set via devlink is the total number of MSI-X that can be used >>>>>>>> when creating VFs. >>>>>>> Yes and no, lspci shows how much MSI-X vectors exist from HW point of >>>>>>> view. Driver can use less than that. It is exactly as your proposed >>>>>>> devlink interface. >>>>>>> >>>>>>> >>>>>> Ok, I have to take a closer look at it. So, are You saing that we should >>>>>> drop this devlink solution and use sysfs interface fo VFs or are You >>>>>> fine with having both? What with MSI-X allocation for subfunction? >>>>> You should drop for VFs and PFs and keep it for SFs only. >>>>> >>>> I understand that MSI-X for VFs can be set via sysfs interface, but what >>>> with PFs? >>> PFs are even more tricker than VFs, as you are changing that number >>> while driver is bound. This makes me wonder what will be lspci output, >>> as you will need to show right number before driver starts to load. >>> >>> You need to present right value if user decided to unbind driver from PF too. >>> >> In case of ice driver lspci -vs shows: >> Capabilities: [70] MSI-X: Enable+ Count=1024 Masked >> >> so all vectors that hw supports (PFs, VFs, misc, etc). Because of that >> total number of MSI-X in the devlink example from cover letter is 1024. >> >> I see that mellanox shows: >> Capabilities: [9c] MSI-X: Enable+ Count=64 Masked >> >> I assume that 64 is in this case MSI-X ony for this one PF (it make >> sense). > Yes and PF MSI-X count can be changed through FW configuration tool, as > we need to write new value when the driver is unbound and we need it to > be persistent. Users are expecting to see "stable" number any time they > reboot the server. It is not the case for VFs, as they are explicitly > created after reboots and start "fresh" after every boot. > > So we set large enough but not too large value as a default for PFs. > If you find sane model of how to change it through kernel, you can count > on our support. I guess one main difference is that in case of ice, PF driver manager resources for all its associated functions, not the FW. So the MSI-X count reported for PF shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox to get their MSI-X vector information.
On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: <...> > > > In case of ice driver lspci -vs shows: > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > I see that mellanox shows: > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > sense). > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > we need to write new value when the driver is unbound and we need it to > > be persistent. Users are expecting to see "stable" number any time they > > reboot the server. It is not the case for VFs, as they are explicitly > > created after reboots and start "fresh" after every boot. > > > > So we set large enough but not too large value as a default for PFs. > > If you find sane model of how to change it through kernel, you can count > > on our support. > > I guess one main difference is that in case of ice, PF driver manager resources > for all its associated functions, not the FW. So the MSI-X count reported for PF > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > to get their MSI-X vector information. What is the output of lspci for ice VF when the driver is not bound? Thanks > > >
On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > > <...> > > > > > In case of ice driver lspci -vs shows: > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > > > I see that mellanox shows: > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > > sense). > > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > > we need to write new value when the driver is unbound and we need it to > > > be persistent. Users are expecting to see "stable" number any time they > > > reboot the server. It is not the case for VFs, as they are explicitly > > > created after reboots and start "fresh" after every boot. > > > > > > So we set large enough but not too large value as a default for PFs. > > > If you find sane model of how to change it through kernel, you can count > > > on our support. > > > > I guess one main difference is that in case of ice, PF driver manager resources > > for all its associated functions, not the FW. So the MSI-X count reported for PF > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > > to get their MSI-X vector information. > > What is the output of lspci for ice VF when the driver is not bound? > > Thanks > It is the same after creating and after unbonding: Capabilities: [70] MSI-X: Enable- Count=17 Masked- 17, because 16 for traffic and 1 for mailbox. Thanks > > > > > >
On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote: > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > > > > <...> > > > > > > > In case of ice driver lspci -vs shows: > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > > > > > I see that mellanox shows: > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > > > sense). > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > > > we need to write new value when the driver is unbound and we need it to > > > > be persistent. Users are expecting to see "stable" number any time they > > > > reboot the server. It is not the case for VFs, as they are explicitly > > > > created after reboots and start "fresh" after every boot. > > > > > > > > So we set large enough but not too large value as a default for PFs. > > > > If you find sane model of how to change it through kernel, you can count > > > > on our support. > > > > > > I guess one main difference is that in case of ice, PF driver manager resources > > > for all its associated functions, not the FW. So the MSI-X count reported for PF > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > > > to get their MSI-X vector information. > > > > What is the output of lspci for ice VF when the driver is not bound? > > > > Thanks > > > > It is the same after creating and after unbonding: > Capabilities: [70] MSI-X: Enable- Count=17 Masked- > > 17, because 16 for traffic and 1 for mailbox. Interesting, I think that your PF violates PCI spec as it always uses word "function" and not "device" while talks about MSI-X related registers. Thanks > > Thanks > > > > > > > > >
On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote: > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote: > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > > > > > > <...> > > > > > > > > > In case of ice driver lspci -vs shows: > > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > > > > > > > I see that mellanox shows: > > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > > > > sense). > > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > > > > we need to write new value when the driver is unbound and we need it to > > > > > be persistent. Users are expecting to see "stable" number any time they > > > > > reboot the server. It is not the case for VFs, as they are explicitly > > > > > created after reboots and start "fresh" after every boot. > > > > > > > > > > So we set large enough but not too large value as a default for PFs. > > > > > If you find sane model of how to change it through kernel, you can count > > > > > on our support. > > > > > > > > I guess one main difference is that in case of ice, PF driver manager resources > > > > for all its associated functions, not the FW. So the MSI-X count reported for PF > > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > > > > to get their MSI-X vector information. > > > > > > What is the output of lspci for ice VF when the driver is not bound? > > > > > > Thanks > > > > > > > It is the same after creating and after unbonding: > > Capabilities: [70] MSI-X: Enable- Count=17 Masked- > > > > 17, because 16 for traffic and 1 for mailbox. > > Interesting, I think that your PF violates PCI spec as it always > uses word "function" and not "device" while talks about MSI-X related > registers. > > Thanks > I made mistake in one comment. 1024 isn't MSI-X amount for device. On ice we have 2048 for the whole device. On two ports card each PF have 1024 MSI-X. Our control register mapping to the internal space looks like that (Assuming two port card; one VF with 5 MSI-X created): INT[PF0].FIRST 0 1 2 . . . 1019 INT[VF0].FIRST __ 1020 | interrupts used 1021 | by VF on PF0 1022 | INT[PF0].LAST 1023 INT[VF0].LAST __| INT[PF1].FIRST 1024 1025 1026 . . . INT[PF1].LAST 2047 MSI-X entry table size for PF0 is 1024, but entry table for VF is a part of PF0 physical space. Do You mean that "sharing" the entry between PF and VF is a violation of PCI spec? Sum of MSI-X count on all function within device shouldn't be grater than 2048? It is hard to find sth about this in spec. I only read that: "MSI-X supports a maximum table size of 2048 entries". I will continue searching for information about that. I don't think that from driver perspective we can change the table size located in message control register. I assume in mlnx the tool that You mentioned can modify this table size? Thanks > > > > Thanks > > > > > > > > > > > >
On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote: > On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote: > > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote: > > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > > > > > > > > <...> > > > > > > > > > > > In case of ice driver lspci -vs shows: > > > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > > > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > > > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > > > > > > > > > I see that mellanox shows: > > > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > > > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > > > > > sense). > > > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > > > > > we need to write new value when the driver is unbound and we need it to > > > > > > be persistent. Users are expecting to see "stable" number any time they > > > > > > reboot the server. It is not the case for VFs, as they are explicitly > > > > > > created after reboots and start "fresh" after every boot. > > > > > > > > > > > > So we set large enough but not too large value as a default for PFs. > > > > > > If you find sane model of how to change it through kernel, you can count > > > > > > on our support. > > > > > > > > > > I guess one main difference is that in case of ice, PF driver manager resources > > > > > for all its associated functions, not the FW. So the MSI-X count reported for PF > > > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > > > > > to get their MSI-X vector information. > > > > > > > > What is the output of lspci for ice VF when the driver is not bound? > > > > > > > > Thanks > > > > > > > > > > It is the same after creating and after unbonding: > > > Capabilities: [70] MSI-X: Enable- Count=17 Masked- > > > > > > 17, because 16 for traffic and 1 for mailbox. > > > > Interesting, I think that your PF violates PCI spec as it always > > uses word "function" and not "device" while talks about MSI-X related > > registers. > > > > Thanks > > > > I made mistake in one comment. 1024 isn't MSI-X amount for device. On > ice we have 2048 for the whole device. On two ports card each PF have > 1024 MSI-X. Our control register mapping to the internal space looks > like that (Assuming two port card; one VF with 5 MSI-X created): > INT[PF0].FIRST 0 > 1 > 2 > > . > . > . > > 1019 INT[VF0].FIRST __ > 1020 | interrupts used > 1021 | by VF on PF0 > 1022 | > INT[PF0].LAST 1023 INT[VF0].LAST __| > INT[PF1].FIRST 1024 > 1025 > 1026 > > . > . > . > > INT[PF1].LAST 2047 > > MSI-X entry table size for PF0 is 1024, but entry table for VF is a part > of PF0 physical space. > > Do You mean that "sharing" the entry between PF and VF is a violation of > PCI spec? You should consult with your PCI specification experts. It was my spec interpretation, which can be wrong. > Sum of MSI-X count on all function within device shouldn't be > grater than 2048? No, it is 2K per-control message/per-function. > It is hard to find sth about this in spec. I only read > that: "MSI-X supports a maximum table size of 2048 entries". I will > continue searching for information about that. > > I don't think that from driver perspective we can change the table size > located in message control register. No, you can't, unless you decide explicitly violate spec. > > I assume in mlnx the tool that You mentioned can modify this table size? Yes, it is FW configuration tool. Thanks > > Thanks > > > > > > > Thanks > > > > > > > > > > > > > > >
On Thu, Nov 17, 2022 at 01:45:38PM +0200, Leon Romanovsky wrote: > On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote: > > On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote: > > > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote: > > > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > > > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: > > > > > > > > > > <...> > > > > > > > > > > > > > In case of ice driver lspci -vs shows: > > > > > > > > Capabilities: [70] MSI-X: Enable+ Count=1024 Masked > > > > > > > > > > > > > > > > so all vectors that hw supports (PFs, VFs, misc, etc). Because of that > > > > > > > > total number of MSI-X in the devlink example from cover letter is 1024. > > > > > > > > > > > > > > > > I see that mellanox shows: > > > > > > > > Capabilities: [9c] MSI-X: Enable+ Count=64 Masked > > > > > > > > > > > > > > > > I assume that 64 is in this case MSI-X ony for this one PF (it make > > > > > > > > sense). > > > > > > > Yes and PF MSI-X count can be changed through FW configuration tool, as > > > > > > > we need to write new value when the driver is unbound and we need it to > > > > > > > be persistent. Users are expecting to see "stable" number any time they > > > > > > > reboot the server. It is not the case for VFs, as they are explicitly > > > > > > > created after reboots and start "fresh" after every boot. > > > > > > > > > > > > > > So we set large enough but not too large value as a default for PFs. > > > > > > > If you find sane model of how to change it through kernel, you can count > > > > > > > on our support. > > > > > > > > > > > > I guess one main difference is that in case of ice, PF driver manager resources > > > > > > for all its associated functions, not the FW. So the MSI-X count reported for PF > > > > > > shows the total vectors(PF netdev, VFs, rdma, SFs). VFs talk to PF over a mailbox > > > > > > to get their MSI-X vector information. > > > > > > > > > > What is the output of lspci for ice VF when the driver is not bound? > > > > > > > > > > Thanks > > > > > > > > > > > > > It is the same after creating and after unbonding: > > > > Capabilities: [70] MSI-X: Enable- Count=17 Masked- > > > > > > > > 17, because 16 for traffic and 1 for mailbox. > > > > > > Interesting, I think that your PF violates PCI spec as it always > > > uses word "function" and not "device" while talks about MSI-X related > > > registers. > > > > > > Thanks > > > > > > > I made mistake in one comment. 1024 isn't MSI-X amount for device. On > > ice we have 2048 for the whole device. On two ports card each PF have > > 1024 MSI-X. Our control register mapping to the internal space looks > > like that (Assuming two port card; one VF with 5 MSI-X created): > > INT[PF0].FIRST 0 > > 1 > > 2 > > > > . > > . > > . > > > > 1019 INT[VF0].FIRST __ > > 1020 | interrupts used > > 1021 | by VF on PF0 > > 1022 | > > INT[PF0].LAST 1023 INT[VF0].LAST __| > > INT[PF1].FIRST 1024 > > 1025 > > 1026 > > > > . > > . > > . > > > > INT[PF1].LAST 2047 > > > > MSI-X entry table size for PF0 is 1024, but entry table for VF is a part > > of PF0 physical space. > > > > Do You mean that "sharing" the entry between PF and VF is a violation of > > PCI spec? > > You should consult with your PCI specification experts. It was my > spec interpretation, which can be wrong. > Sure, I will > > > Sum of MSI-X count on all function within device shouldn't be > > grater than 2048? > > No, it is 2K per-control message/per-function. > > > > It is hard to find sth about this in spec. I only read > > that: "MSI-X supports a maximum table size of 2048 entries". I will > > continue searching for information about that. > > > > I don't think that from driver perspective we can change the table size > > located in message control register. > > No, you can't, unless you decide explicitly violate spec. > > > > > I assume in mlnx the tool that You mentioned can modify this table size? > > Yes, it is FW configuration tool. > Thanks for clarification. In summary, if this is really violation of PCI spec we for sure will have to fix that and resource managment implemented in this patchset will not make sense (as config for PF MSI-X amount will have to happen in FW). If it isn't, is it still NACK from You? I mean, if we implement the devlink resources managment (maybe not generic one, only define in ice) and sysfs for setting VF MSI-X, will You be ok with that? Or using devlink to manage MSI-X resources isn't in general good solution? Our motivation here is to allow user to configure MSI-X allocation for his own use case. For example giving only 1 MSI-X for ethernet and RDMA and spending rest on VFs. If I understand correctly, on mlnx card user can set PF MSI-X to 1 in FW configuration tool and achieve the same. Currently we don't have other mechanism to change default number of MSI-X on ethernet. > Thanks > > > > > Thanks > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > >
On Thu, Nov 17, 2022 at 02:39:50PM +0100, Michal Swiatkowski wrote: > On Thu, Nov 17, 2022 at 01:45:38PM +0200, Leon Romanovsky wrote: > > On Thu, Nov 17, 2022 at 12:10:21PM +0100, Michal Swiatkowski wrote: > > > On Wed, Nov 16, 2022 at 07:59:43PM +0200, Leon Romanovsky wrote: > > > > On Wed, Nov 16, 2022 at 01:04:36PM +0100, Michal Swiatkowski wrote: > > > > > On Wed, Nov 16, 2022 at 08:04:56AM +0200, Leon Romanovsky wrote: > > > > > > On Tue, Nov 15, 2022 at 07:59:06PM -0600, Samudrala, Sridhar wrote: > > > > > > > On 11/15/2022 11:57 AM, Leon Romanovsky wrote: <...> > > Thanks for clarification. > In summary, if this is really violation of PCI spec we for sure will > have to fix that and resource managment implemented in this patchset > will not make sense (as config for PF MSI-X amount will have to happen > in FW). > > If it isn't, is it still NACK from You? I mean, if we implement the > devlink resources managment (maybe not generic one, only define in ice) > and sysfs for setting VF MSI-X, will You be ok with that? Or using > devlink to manage MSI-X resources isn't in general good solution? NACK/ACK question is not for me, it is not my decision :) I don't think that management of PCI specific parameters in devlink is right idea. PCI has his own subsystem with rules and assumptions, netdev shouldn't mangle them. In MSI-X case, this even more troublesome as users sees these values through lspci without driver attached to it. For example (very popular case), users creates VFs, unbinds driver from all functions (PF and VFs) and pass them to VMs (bind to vfio driver). Your solution being in wrong layer won't work there. Thanks
On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote: > I don't think that management of PCI specific parameters in devlink is > right idea. PCI has his own subsystem with rules and assumptions, netdev > shouldn't mangle them. Not netdev, devlink, which covers netdev, RDMA and others. > In MSI-X case, this even more troublesome as users > sees these values through lspci without driver attached to it. I'm no PCI expert either but FWIW to me the patch set seems reasonable. Whether management FW policies are configured via core PCI sysfs or subsystem-specific-ish APIs is a toss up. Adding Bjorn and please CC him on the next rev. Link to the series: https://lore.kernel.org/all/20221114125755.13659-1-michal.swiatkowski@linux.intel.com/
On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote: > On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote: > > I don't think that management of PCI specific parameters in devlink is > > right idea. PCI has his own subsystem with rules and assumptions, netdev > > shouldn't mangle them. > > Not netdev, devlink, which covers netdev, RDMA and others. devlink is located in net/*, it is netdev. Regarding RDMA, it is not fully accurate. We use devlink to convey information to FW through pci device located in netdev. Some of such parameters are RDMA related. However, we don't configure RDMA properties through devlink, there is a special tool for that (rdmatool). > > > In MSI-X case, this even more troublesome as users > > sees these values through lspci without driver attached to it. > > I'm no PCI expert either but FWIW to me the patch set seems reasonable. > Whether management FW policies are configured via core PCI sysfs or > subsystem-specific-ish APIs is a toss up. Adding Bjorn and please CC him > on the next rev. The thing is that it must be managed by FW. Earlier in this thread, I pointed the pretty normal use case where you need to have valid PCI structure without any driver involvement. > > Link to the series: > https://lore.kernel.org/all/20221114125755.13659-1-michal.swiatkowski@linux.intel.com/
> Subject: Re: [PATCH net-next 00/13] resource management using devlink reload > > On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote: > > On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote: > > > I don't think that management of PCI specific parameters in devlink > > > is right idea. PCI has his own subsystem with rules and assumptions, > > > netdev shouldn't mangle them. > > > > Not netdev, devlink, which covers netdev, RDMA and others. > > devlink is located in net/*, it is netdev. > Regarding RDMA, it is not fully accurate. We use devlink to convey information to > FW through pci device located in netdev. Some of such parameters are RDMA > related. However, we don't configure RDMA properties through devlink, there is a > special tool for that (rdmatool). rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered. And cannot be used for any configurations at driver init time. Don't we already have PCI specific parameters managed through devlink today? https://docs.kernel.org/networking/devlink/devlink-params.html enable_sriov ignore_ari msix_vec_per_pf_max msix_vec_per_pf_min How are these in a different bracket from what Michal is trying to do? Or were these ones a bad idea in hindsight? Shiraz
On Fri, Nov 18, 2022 at 02:23:50PM +0000, Saleem, Shiraz wrote: > > Subject: Re: [PATCH net-next 00/13] resource management using devlink reload > > > > On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote: > > > On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote: > > > > I don't think that management of PCI specific parameters in devlink > > > > is right idea. PCI has his own subsystem with rules and assumptions, > > > > netdev shouldn't mangle them. > > > > > > Not netdev, devlink, which covers netdev, RDMA and others. > > > > devlink is located in net/*, it is netdev. > > Regarding RDMA, it is not fully accurate. We use devlink to convey information to > > FW through pci device located in netdev. Some of such parameters are RDMA > > related. However, we don't configure RDMA properties through devlink, there is a > > special tool for that (rdmatool). > > rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered. > And cannot be used for any configurations at driver init time. Like I said, we use devlink to configure FW and "core" device to which ib_device is connected. We don't configure RDMA specific properties, but only device specific ones. > > Don't we already have PCI specific parameters managed through devlink today? > > https://docs.kernel.org/networking/devlink/devlink-params.html > enable_sriov > ignore_ari > msix_vec_per_pf_max > msix_vec_per_pf_min > > How are these in a different bracket from what Michal is trying to do? Or were these ones a bad idea in hindsight? Yes as it doesn't belong to net/* and it exists there just because of one reason: ability to write to FW of specific device. At least for ARI, I don't see that bnxt driver masked ARI Extended Capability and informed PCI subsystem about it, so PCI core will recreate device. Thanks
On 11/18/2022 11:31 AM, Leon Romanovsky wrote: > On Fri, Nov 18, 2022 at 02:23:50PM +0000, Saleem, Shiraz wrote: >>> Subject: Re: [PATCH net-next 00/13] resource management using devlink reload >>> >>> On Thu, Nov 17, 2022 at 07:36:18PM -0800, Jakub Kicinski wrote: >>>> On Thu, 17 Nov 2022 19:38:48 +0200 Leon Romanovsky wrote: >>>>> I don't think that management of PCI specific parameters in devlink >>>>> is right idea. PCI has his own subsystem with rules and assumptions, >>>>> netdev shouldn't mangle them. >>>> Not netdev, devlink, which covers netdev, RDMA and others. >>> devlink is located in net/*, it is netdev. >>> Regarding RDMA, it is not fully accurate. We use devlink to convey information to >>> FW through pci device located in netdev. Some of such parameters are RDMA >>> related. However, we don't configure RDMA properties through devlink, there is a >>> special tool for that (rdmatool). >> rdmatool though is usable only once the rdma driver probe() completes and the ib_device is registered. >> And cannot be used for any configurations at driver init time. > Like I said, we use devlink to configure FW and "core" device to which > ib_device is connected. We don't configure RDMA specific properties, but > only device specific ones. I don't think this patch series is configuring any PCI specific parameters OR per-PCI device parameters. The FW splits the device level MSI-X vectors across its PFs(1 per port). Each PF manages its own MSI-X vectors and distributes them across the different functions associated with that PF. So far the PF driver has been doing this with a hard-coded policy implemented within the driver. We are exposing that policy via devlink and allowing a host admin to split the MSI-X vectors that are assigned to a PF by the FW across its different functions (PF, all its VFs, all its SFs and RDMA) based on the use cases/workload running on the host.