Message ID | 20221109184244.7032-1-ajit.khaparde@broadcom.com (mailing list archive) |
---|---|
Headers | show |
Series | Add Auxiliary driver support | expand |
On Wed, Nov 09, 2022 at 10:42:38AM -0800, Ajit Khaparde wrote: > Add auxiliary device driver for Broadcom devices. > The bnxt_en driver will register and initialize an aux device > if RDMA is enabled in the underlying device. > The bnxt_re driver will then probe and initialize the > RoCE interfaces with the infiniband stack. > > We got rid of the bnxt_en_ops which the bnxt_re driver used to > communicate with bnxt_en. > Similarly We have tried to clean up most of the bnxt_ulp_ops. > In most of the cases we used the functions and entry points provided > by the auxiliary bus driver framework. > And now these are the minimal functions needed to support the functionality. > > We will try to work on getting rid of the remaining if we find any > other viable option in future. I still see extra checks for something that was already checked in upper functions, for example in bnxt_re_register_netdev() you check rdev, which you already checked before. However, the most important part is still existence of bnxt_ulp_ops, which shows completely no-go thing - SR-IOV config in RDMA code. 302 static struct bnxt_ulp_ops bnxt_re_ulp_ops = { 303 .ulp_sriov_config = bnxt_re_sriov_config, 304 .ulp_irq_stop = bnxt_re_stop_irq, 305 .ulp_irq_restart = bnxt_re_start_irq 306 }; All PCI management logic and interfaces are needed to be inside eth part of your driver and only that part should implement SR-IOV config. Once user enabled SR-IOV, the PCI driver should create auxiliary devices for each VF. These device will have RDMA capabilities and it will trigger RDMA driver to bind to them. Thanks
Leon, We appreciate your valuable feedback. Please see inline. On Thu, Nov 10, 2022 at 2:53 AM Leon Romanovsky <leon@kernel.org> wrote: > > On Wed, Nov 09, 2022 at 10:42:38AM -0800, Ajit Khaparde wrote: > > Add auxiliary device driver for Broadcom devices. > > The bnxt_en driver will register and initialize an aux device > > if RDMA is enabled in the underlying device. > > The bnxt_re driver will then probe and initialize the > > RoCE interfaces with the infiniband stack. > > > > We got rid of the bnxt_en_ops which the bnxt_re driver used to > > communicate with bnxt_en. > > Similarly We have tried to clean up most of the bnxt_ulp_ops. > > In most of the cases we used the functions and entry points provided > > by the auxiliary bus driver framework. > > And now these are the minimal functions needed to support the functionality. > > > > We will try to work on getting rid of the remaining if we find any > > other viable option in future. > > I still see extra checks for something that was already checked in upper > functions, for example in bnxt_re_register_netdev() you check rdev, which > you already checked before. Sure. I will do another sweep and clean up. > > However, the most important part is still existence of bnxt_ulp_ops, > which shows completely no-go thing - SR-IOV config in RDMA code. > > 302 static struct bnxt_ulp_ops bnxt_re_ulp_ops = { > 303 .ulp_sriov_config = bnxt_re_sriov_config, > 304 .ulp_irq_stop = bnxt_re_stop_irq, > 305 .ulp_irq_restart = bnxt_re_start_irq > 306 }; > > All PCI management logic and interfaces are needed to be inside eth part > of your driver and only that part should implement SR-IOV config. Once > user enabled SR-IOV, the PCI driver should create auxiliary devices for > each VF. These device will have RDMA capabilities and it will trigger RDMA > driver to bind to them. I agree and once the PF creates the auxiliary devices for the VF, the RoCE Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re design is that the RoCE driver is responsible for making adjustments to the RoCE resources. So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the NIC resources for the VF, and such, it tries to call into the bnxt_re driver for the same purpose. 1. We do something like this to the auxiliary_device structure: diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h index de21d9d24a95..4e581fbf458f 100644 --- a/include/linux/auxiliary_bus.h +++ b/include/linux/auxiliary_bus.h @@ -148,6 +148,7 @@ struct auxiliary_device { * @shutdown: Called at shut-down time to quiesce the device. * @suspend: Called to put the device to sleep mode. Usually to a power state. * @resume: Called to bring a device from sleep mode. + * @sriov_configure: Called to allow configuration of VFs . * @name: Driver name. * @driver: Core driver structure. * @id_table: Table of devices this driver should match on the bus. @@ -183,6 +184,7 @@ struct auxiliary_driver { void (*shutdown)(struct auxiliary_device *auxdev); int (*suspend)(struct auxiliary_device *auxdev, pm_message_t state); int (*resume)(struct auxiliary_device *auxdev); + int (*sriov_configure)(struct auxiliary_device *auxdev, int num_vfs); /* On PF */ const char *name; struct device_driver driver; const struct auxiliary_device_id *id_table; Then the bnxt_en driver could call into bnxt_re via that interface. Please let me know what you feel. 2. While it may take care of the first function in the ulp_ops, it leaves us with two. And that is where I will need some input if we need to absolutely get rid of the ops. 2a. We may be able to use the auxiliary_device suspend & resume with a private flag in the driver's aux_dev pointer. 2b. Or just like (1) above, add another hook to auxiliary_driver void (*restart)(struct auxiliary_device *auxdev); And then use the auxiliary_driver shutdown & restart with a private flag. Note that we may get creative right now and get rid of the ulp_ops. But the bnxt_en driver having a need to update the bnxt_re driver is a part of the design. So it will help if we can consider beyond the ulp_irq_stop, ulp_irq_restart. 2c. Maybe keep the bnxt_ulp_ops for that reason? Thank you for your time. Thanks Ajit > > Thanks
On Mon, Nov 14, 2022 at 04:47:31PM -0800, Ajit Khaparde wrote: > Leon, > We appreciate your valuable feedback. > Please see inline. > > On Thu, Nov 10, 2022 at 2:53 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Wed, Nov 09, 2022 at 10:42:38AM -0800, Ajit Khaparde wrote: > > > Add auxiliary device driver for Broadcom devices. > > > The bnxt_en driver will register and initialize an aux device > > > if RDMA is enabled in the underlying device. > > > The bnxt_re driver will then probe and initialize the > > > RoCE interfaces with the infiniband stack. > > > > > > We got rid of the bnxt_en_ops which the bnxt_re driver used to > > > communicate with bnxt_en. > > > Similarly We have tried to clean up most of the bnxt_ulp_ops. > > > In most of the cases we used the functions and entry points provided > > > by the auxiliary bus driver framework. > > > And now these are the minimal functions needed to support the functionality. > > > > > > We will try to work on getting rid of the remaining if we find any > > > other viable option in future. > > > > I still see extra checks for something that was already checked in upper > > functions, for example in bnxt_re_register_netdev() you check rdev, which > > you already checked before. > Sure. I will do another sweep and clean up. > > > > > However, the most important part is still existence of bnxt_ulp_ops, > > which shows completely no-go thing - SR-IOV config in RDMA code. > > > > 302 static struct bnxt_ulp_ops bnxt_re_ulp_ops = { > > 303 .ulp_sriov_config = bnxt_re_sriov_config, > > 304 .ulp_irq_stop = bnxt_re_stop_irq, > > 305 .ulp_irq_restart = bnxt_re_start_irq > > 306 }; > > > > All PCI management logic and interfaces are needed to be inside eth part > > of your driver and only that part should implement SR-IOV config. Once > > user enabled SR-IOV, the PCI driver should create auxiliary devices for > > each VF. These device will have RDMA capabilities and it will trigger RDMA > > driver to bind to them. > I agree and once the PF creates the auxiliary devices for the VF, the RoCE > Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re > design is that > the RoCE driver is responsible for making adjustments to the RoCE resources. You can still do these adjustments by checking type of function that called to RDMA .probe. PCI core exposes some functions to help distinguish between PF and VFs. > > So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the > NIC resources for the VF, and such, it tries to call into the bnxt_re > driver for the > same purpose. If I read code correctly, all these resources are for one PCI function. Something like this: bnxt_re_probe() { ... if (is_virtfn(p)) bnxt_re_sriov_config(p); ... } > > 1. We do something like this to the auxiliary_device structure: > > diff --git a/include/linux/auxiliary_bus.h b/include/linux/auxiliary_bus.h > index de21d9d24a95..4e581fbf458f 100644 > --- a/include/linux/auxiliary_bus.h > +++ b/include/linux/auxiliary_bus.h > @@ -148,6 +148,7 @@ struct auxiliary_device { > * @shutdown: Called at shut-down time to quiesce the device. > * @suspend: Called to put the device to sleep mode. Usually to a power state. > * @resume: Called to bring a device from sleep mode. > + * @sriov_configure: Called to allow configuration of VFs . > * @name: Driver name. > * @driver: Core driver structure. > * @id_table: Table of devices this driver should match on the bus. > @@ -183,6 +184,7 @@ struct auxiliary_driver { > void (*shutdown)(struct auxiliary_device *auxdev); > int (*suspend)(struct auxiliary_device *auxdev, pm_message_t state); > int (*resume)(struct auxiliary_device *auxdev); > + int (*sriov_configure)(struct auxiliary_device *auxdev, int > num_vfs); /* On PF */ > const char *name; > struct device_driver driver; > const struct auxiliary_device_id *id_table; > > Then the bnxt_en driver could call into bnxt_re via that interface. > @sriov_configure callback is PCI specific and doesn't belong to aux devices. > Please let me know what you feel. > > 2. While it may take care of the first function in the ulp_ops, it > leaves us with two. > And that is where I will need some input if we need to absolutely get > rid of the ops. > > 2a. We may be able to use the auxiliary_device suspend & resume with a > private flag > in the driver's aux_dev pointer. > 2b. Or just like (1) above, add another hook to auxiliary_driver > void (*restart)(struct auxiliary_device *auxdev); > And then use the auxiliary_driver shutdown & restart with a private flag. > > Note that we may get creative right now and get rid of the ulp_ops. > But the bnxt_en driver having a need to update the bnxt_re driver is a > part of the > design. So it will help if we can consider beyond the ulp_irq_stop, > ulp_irq_restart. > 2c. Maybe keep the bnxt_ulp_ops for that reason? It is nicer to get rid from bnxt_ulp_ops completely, but it is not must. To get rid from sriov_configure is the most important comment here. Thanks > > Thank you for your time. > > Thanks > Ajit > > > > > Thanks
On Wed, Nov 16, 2022 at 5:22 AM Leon Romanovsky <leon@kernel.org> wrote: > ::snip:: > > > All PCI management logic and interfaces are needed to be inside eth part > > > of your driver and only that part should implement SR-IOV config. Once > > > user enabled SR-IOV, the PCI driver should create auxiliary devices for > > > each VF. These device will have RDMA capabilities and it will trigger RDMA > > > driver to bind to them. > > I agree and once the PF creates the auxiliary devices for the VF, the RoCE > > Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re > > design is that > > the RoCE driver is responsible for making adjustments to the RoCE resources. > > You can still do these adjustments by checking type of function that > called to RDMA .probe. PCI core exposes some functions to help distinguish between > PF and VFs. > > > > > So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the > > NIC resources for the VF, and such, it tries to call into the bnxt_re > > driver for the > > same purpose. > > If I read code correctly, all these resources are for one PCI function. > > Something like this: > > bnxt_re_probe() > { > ... > if (is_virtfn(p)) > bnxt_re_sriov_config(p); > ... > } I understand what you are suggesting. But what I want is a way to do this in the context of the PF preferably before the VFs are probed. So we are trying to call the bnxt_re_sriov_config in the context of handling the PF's sriov_configure implementation. Having the ulp_ops is allowing us to avoid resource wastage and assumptions in the bnxt_re driver. ::snip::
On Tue, Nov 22, 2022 at 07:02:45AM -0800, Ajit Khaparde wrote: > On Wed, Nov 16, 2022 at 5:22 AM Leon Romanovsky <leon@kernel.org> wrote: > > > ::snip:: > > > > All PCI management logic and interfaces are needed to be inside eth part > > > > of your driver and only that part should implement SR-IOV config. Once > > > > user enabled SR-IOV, the PCI driver should create auxiliary devices for > > > > each VF. These device will have RDMA capabilities and it will trigger RDMA > > > > driver to bind to them. > > > I agree and once the PF creates the auxiliary devices for the VF, the RoCE > > > Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re > > > design is that > > > the RoCE driver is responsible for making adjustments to the RoCE resources. > > > > You can still do these adjustments by checking type of function that > > called to RDMA .probe. PCI core exposes some functions to help distinguish between > > PF and VFs. > > > > > > > > So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the > > > NIC resources for the VF, and such, it tries to call into the bnxt_re > > > driver for the > > > same purpose. > > > > If I read code correctly, all these resources are for one PCI function. > > > > Something like this: > > > > bnxt_re_probe() > > { > > ... > > if (is_virtfn(p)) > > bnxt_re_sriov_config(p); > > ... > > } > I understand what you are suggesting. > But what I want is a way to do this in the context of the PF > preferably before the VFs are probed. I don't understand the last sentence. You call to this sriov_config in bnxt_re driver without any protection from VFs being probed, > So we are trying to call the > bnxt_re_sriov_config in the context of handling the PF's > sriov_configure implementation. Having the ulp_ops is allowing us to > avoid resource wastage and assumptions in the bnxt_re driver. To which resource wastage are you referring? There are no differences if same limits will be in bnxt_en driver when RDMA bnxt device is created or in bnxt_re which will be called once RDMA device is created. Thanks > > ::snip::
On Tue, Nov 22, 2022 at 10:59 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Tue, Nov 22, 2022 at 07:02:45AM -0800, Ajit Khaparde wrote: > > On Wed, Nov 16, 2022 at 5:22 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > ::snip:: > > > > > All PCI management logic and interfaces are needed to be inside eth part > > > > > of your driver and only that part should implement SR-IOV config. Once > > > > > user enabled SR-IOV, the PCI driver should create auxiliary devices for > > > > > each VF. These device will have RDMA capabilities and it will trigger RDMA > > > > > driver to bind to them. > > > > I agree and once the PF creates the auxiliary devices for the VF, the RoCE > > > > Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re > > > > design is that > > > > the RoCE driver is responsible for making adjustments to the RoCE resources. > > > > > > You can still do these adjustments by checking type of function that > > > called to RDMA .probe. PCI core exposes some functions to help distinguish between > > > PF and VFs. > > > > > > > > > > > So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the > > > > NIC resources for the VF, and such, it tries to call into the bnxt_re > > > > driver for the > > > > same purpose. > > > > > > If I read code correctly, all these resources are for one PCI function. > > > > > > Something like this: > > > > > > bnxt_re_probe() > > > { > > > ... > > > if (is_virtfn(p)) > > > bnxt_re_sriov_config(p); > > > ... > > > } > > I understand what you are suggesting. > > But what I want is a way to do this in the context of the PF > > preferably before the VFs are probed. > > I don't understand the last sentence. You call to this sriov_config in > bnxt_re driver without any protection from VFs being probed, Let me elaborate - When a user sets num_vfs to a non-zero number, the PCI driver hook sriov_configure calls bnxt_sriov_configure(). Once pci_enable_sriov() succeeds, bnxt_ulp_sriov_cfg() is issued under bnxt_sriov_configure(). All this happens under bnxt_en. bnxt_ulp_sriov_cfg() ultimately calls into the bnxt_re driver. Since bnxt_sriov_configure() is called only for PFs, bnxt_ulp_sriov_cfg() is called for PFs only. Once bnxt_ulp_sriov_cfg() calls into the bnxt_re via the ulp_ops, it adjusts the QPs, SRQs, CQs, MRs, GIDs and such. > > > So we are trying to call the > > bnxt_re_sriov_config in the context of handling the PF's > > sriov_configure implementation. Having the ulp_ops is allowing us to > > avoid resource wastage and assumptions in the bnxt_re driver. > > To which resource wastage are you referring? Essentially the PF driver reserves a set of above resources for the PF, and divides the remaining resources among the VFs. If the calculation is based on sriov_totalvfs instead of sriov_numvfs, there can be a difference in the resources provisioned for a VF. And that is because a user may create a subset of VFs instead of the total VFs allowed in the PCI SR-IOV capability register. I was referring to the resource wastage in that deployment scenario. Thanks Ajit > > There are no differences if same limits will be in bnxt_en driver when > RDMA bnxt device is created or in bnxt_re which will be called once RDMA > device is created. > > Thanks > > > > > ::snip:: > >
On Mon, Nov 28, 2022 at 06:01:13PM -0800, Ajit Khaparde wrote: > On Tue, Nov 22, 2022 at 10:59 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Tue, Nov 22, 2022 at 07:02:45AM -0800, Ajit Khaparde wrote: > > > On Wed, Nov 16, 2022 at 5:22 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > ::snip:: > > > > > > All PCI management logic and interfaces are needed to be inside eth part > > > > > > of your driver and only that part should implement SR-IOV config. Once > > > > > > user enabled SR-IOV, the PCI driver should create auxiliary devices for > > > > > > each VF. These device will have RDMA capabilities and it will trigger RDMA > > > > > > driver to bind to them. > > > > > I agree and once the PF creates the auxiliary devices for the VF, the RoCE > > > > > Vf indeed get probed and created. But the twist in bnxt_en/bnxt_re > > > > > design is that > > > > > the RoCE driver is responsible for making adjustments to the RoCE resources. > > > > > > > > You can still do these adjustments by checking type of function that > > > > called to RDMA .probe. PCI core exposes some functions to help distinguish between > > > > PF and VFs. > > > > > > > > > > > > > > So once the VF's are created and the bnxt_en driver enables SRIOV adjusts the > > > > > NIC resources for the VF, and such, it tries to call into the bnxt_re > > > > > driver for the > > > > > same purpose. > > > > > > > > If I read code correctly, all these resources are for one PCI function. > > > > > > > > Something like this: > > > > > > > > bnxt_re_probe() > > > > { > > > > ... > > > > if (is_virtfn(p)) > > > > bnxt_re_sriov_config(p); > > > > ... > > > > } > > > I understand what you are suggesting. > > > But what I want is a way to do this in the context of the PF > > > preferably before the VFs are probed. > > > > I don't understand the last sentence. You call to this sriov_config in > > bnxt_re driver without any protection from VFs being probed, > > Let me elaborate - > When a user sets num_vfs to a non-zero number, the PCI driver hook > sriov_configure calls bnxt_sriov_configure(). Once pci_enable_sriov() > succeeds, bnxt_ulp_sriov_cfg() is issued under bnxt_sriov_configure(). > All this happens under bnxt_en. > bnxt_ulp_sriov_cfg() ultimately calls into the bnxt_re driver. > Since bnxt_sriov_configure() is called only for PFs, bnxt_ulp_sriov_cfg() > is called for PFs only. > > Once bnxt_ulp_sriov_cfg() calls into the bnxt_re via the ulp_ops, > it adjusts the QPs, SRQs, CQs, MRs, GIDs and such. Once you called to pci_enable_sriov(), PCI core created sysfs entries and it triggers udev rules and VFs probe. Because you are calling it in bnxt_sriov_configure(), you will have inherit protection for PF with PCI lock, but not for VFs. > > > > > > So we are trying to call the > > > bnxt_re_sriov_config in the context of handling the PF's > > > sriov_configure implementation. Having the ulp_ops is allowing us to > > > avoid resource wastage and assumptions in the bnxt_re driver. > > > > To which resource wastage are you referring? > Essentially the PF driver reserves a set of above resources for the PF, > and divides the remaining resources among the VFs. > If the calculation is based on sriov_totalvfs instead of sriov_numvfs, > there can be a difference in the resources provisioned for a VF. > And that is because a user may create a subset of VFs instead of the > total VFs allowed in the PCI SR-IOV capability register. > I was referring to the resource wastage in that deployment scenario. It is ok, set all needed limits in bnxt_en. You don't need to call to bnxt_re for that. > > Thanks > Ajit > > > > > There are no differences if same limits will be in bnxt_en driver when > > RDMA bnxt device is created or in bnxt_re which will be called once RDMA > > device is created. > > > > Thanks > > > > > > > > ::snip:: > > > >
On Tue, Nov 29, 2022 at 1:13 AM Leon Romanovsky <leon@kernel.org> wrote: > > On Mon, Nov 28, 2022 at 06:01:13PM -0800, Ajit Khaparde wrote: > > On Tue, Nov 22, 2022 at 10:59 PM Leon Romanovsky <leon@kernel.org> wrote: > > > ::: snip ::: > > > > > > > > > So we are trying to call the > > > > bnxt_re_sriov_config in the context of handling the PF's > > > > sriov_configure implementation. Having the ulp_ops is allowing us to > > > > avoid resource wastage and assumptions in the bnxt_re driver. > > > > > > To which resource wastage are you referring? > > Essentially the PF driver reserves a set of above resources for the PF, > > and divides the remaining resources among the VFs. > > If the calculation is based on sriov_totalvfs instead of sriov_numvfs, > > there can be a difference in the resources provisioned for a VF. > > And that is because a user may create a subset of VFs instead of the > > total VFs allowed in the PCI SR-IOV capability register. > > I was referring to the resource wastage in that deployment scenario. > > It is ok, set all needed limits in bnxt_en. You don't need to call to > bnxt_re for that. Thanks. We have removed the sriov_config callback. I will send the fresh patchset in a few minutes. > > > > > Thanks > > Ajit > > > > > > > > There are no differences if same limits will be in bnxt_en driver when > > > RDMA bnxt device is created or in bnxt_re which will be called once RDMA > > > device is created. > > > > > > Thanks > > > > > > > > > > > ::snip:: > > > > > > > >