Message ID | 20220421071747.1892-11-chengyou@linux.alibaba.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Jason Gunthorpe |
Headers | show |
Series | Elastic RDMA Adapter (ERDMA) driver | expand |
On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > +static struct rdma_link_ops erdma_link_ops = { > + .type = "erdma", > + .newlink = erdma_newlink, > +}; Why is there still a newlink? Jason
On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > >> +static struct rdma_link_ops erdma_link_ops = { >> + .type = "erdma", >> + .newlink = erdma_newlink, >> +}; > > Why is there still a newlink? > > Jason Yeah, I remember your suggestion that the ibdev should keep the same lifecycle with its PCI device, and so does it now. The initialization flow for erdma now: probe: - Hardware specified initialization - IB device initialization - Calling ib_register_device with ibdev->netdev == NULL After probe, The ib device has been registered, but we left it in invalid state. To fully complete the initialization, we should set the ibdev->netdev. And the newlink command in erdma only do one thing now: set the ibdev->netdev if the rule matches, and it is uncorrelated with the ib device lifecycle. I think this may be the best way to link the netdev to ibdev for erdma. Using netdev notifier is not good because we should register a netdev notifier for the module. If this way is not proper or there are some better ways, please let me know. Thanks, Cheng Xu
On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > > > > > +static struct rdma_link_ops erdma_link_ops = { > > > + .type = "erdma", > > > + .newlink = erdma_newlink, > > > +}; > > > > Why is there still a newlink? > > > > Jason > > > Yeah, I remember your suggestion that the ibdev should keep the same > lifecycle with its PCI device, and so does it now. > > The initialization flow for erdma now: > probe: > - Hardware specified initialization > - IB device initialization > - Calling ib_register_device with ibdev->netdev == NULL > > After probe, The ib device has been registered, but we left it in > invalid state. > To fully complete the initialization, we should set the ibdev->netdev. > > And the newlink command in erdma only do one thing now: set the > ibdev->netdev if the rule matches, and it is uncorrelated with the > ib device lifecycle. This is not what newlink is for, it can't be used like this Jason
On 5/16/22 7:49 PM, Jason Gunthorpe wrote: > On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: >> >> >> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>> >>>> +static struct rdma_link_ops erdma_link_ops = { >>>> + .type = "erdma", >>>> + .newlink = erdma_newlink, >>>> +}; >>> >>> Why is there still a newlink? >>> >>> Jason >> >> >> Yeah, I remember your suggestion that the ibdev should keep the same >> lifecycle with its PCI device, and so does it now. >> >> The initialization flow for erdma now: >> probe: >> - Hardware specified initialization >> - IB device initialization >> - Calling ib_register_device with ibdev->netdev == NULL >> >> After probe, The ib device has been registered, but we left it in >> invalid state. >> To fully complete the initialization, we should set the ibdev->netdev. >> >> And the newlink command in erdma only do one thing now: set the >> ibdev->netdev if the rule matches, and it is uncorrelated with the >> ib device lifecycle. > > This is not what newlink is for, it can't be used like this > > Jason Yeah, the newlink is only used by siw and rxe to control their ib device lifecycle new But for the usage of newlink, it is triggered by 'rdma link add' command. In my opinion, the 'rdma link' command set is all about display (rdma link show) or management of the link relationship between ib device and net device (rdma link add/del). And what erdma needs is setting up the link between ib device and net device. So, though the current usage is controling the lifecycle of siw/rxe, but for the command's naming and connotation, should it be OK to expand the usage of newlink to meet our requirement? Or, Is it ok that erdma registers a net notifier in our module to handle this? Thanks, Cheng Xu
On 5/16/22 7:49 PM, Jason Gunthorpe wrote: > On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: >> >> >> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>> >>>> +static struct rdma_link_ops erdma_link_ops = { >>>> + .type = "erdma", >>>> + .newlink = erdma_newlink, >>>> +}; >>> >>> Why is there still a newlink? >>> >>> Jason >> >> >> Yeah, I remember your suggestion that the ibdev should keep the same >> lifecycle with its PCI device, and so does it now. >> >> The initialization flow for erdma now: >> probe: >> - Hardware specified initialization >> - IB device initialization >> - Calling ib_register_device with ibdev->netdev == NULL >> >> After probe, The ib device has been registered, but we left it in >> invalid state. >> To fully complete the initialization, we should set the ibdev->netdev. >> >> And the newlink command in erdma only do one thing now: set the >> ibdev->netdev if the rule matches, and it is uncorrelated with the >> ib device lifecycle. > > This is not what newlink is for, it can't be used like this > > Jason [A typo in previous reply, so I re-edit it] Yeah, the newlink is only used by siw and rxe to control their ib device lifecycle now. But for the usage of newlink, it is triggered by 'rdma link add' command. In my opinion, the 'rdma link' command set is all about display (rdma link show) or management of the link relationship between ib device and net device (rdma link add/del). And what erdma needs is setting up the link between ib device and net device. So, though the current usage is controling the lifecycle of siw/rxe, but for the command's naming and connotation, should it be OK to expand the usage of newlink to meet our requirement? Or, Is it ok that erdma registers a net notifier in our module to handle this? Thanks, Cheng Xu
On 5/16/22 8:40 PM, Cheng Xu wrote: > > > On 5/16/22 7:49 PM, Jason Gunthorpe wrote: >> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: >>> >>> >>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>>> >>>>> +static struct rdma_link_ops erdma_link_ops = { >>>>> + .type = "erdma", >>>>> + .newlink = erdma_newlink, >>>>> +}; >>>> >>>> Why is there still a newlink? >>>> >>>> Jason >>> >>> >>> Yeah, I remember your suggestion that the ibdev should keep the same >>> lifecycle with its PCI device, and so does it now. >>> >>> The initialization flow for erdma now: >>> probe: >>> - Hardware specified initialization >>> - IB device initialization >>> - Calling ib_register_device with ibdev->netdev == NULL >>> >>> After probe, The ib device has been registered, but we left it in >>> invalid state. >>> To fully complete the initialization, we should set the ibdev->netdev. >>> >>> And the newlink command in erdma only do one thing now: set the >>> ibdev->netdev if the rule matches, and it is uncorrelated with the >>> ib device lifecycle. >> >> This is not what newlink is for, it can't be used like this >> >> Jason > > > [A typo in previous reply, so I re-edit it] > <...> > Or, Is it ok that erdma registers a net notifier in our module to handle > this? > I think this twice, use a net notifier of module can not solve this, due to the leak of all ib devices information in erdma module. The solution may be simple: register net notifier per device in probe, and call ib_device_set_netdev if matches in the notifier callback. Should this be the right way? Thanks, Cheng Xu
On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote: > > > On 5/16/22 8:40 PM, Cheng Xu wrote: > > > > > > On 5/16/22 7:49 PM, Jason Gunthorpe wrote: > > > On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: > > > > > > > > > > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > > > > > > > > > > > +static struct rdma_link_ops erdma_link_ops = { > > > > > > + .type = "erdma", > > > > > > + .newlink = erdma_newlink, > > > > > > +}; > > > > > > > > > > Why is there still a newlink? > > > > > > > > > > Jason > > > > > > > > > > > > Yeah, I remember your suggestion that the ibdev should keep the same > > > > lifecycle with its PCI device, and so does it now. > > > > > > > > The initialization flow for erdma now: > > > > probe: > > > > - Hardware specified initialization > > > > - IB device initialization > > > > - Calling ib_register_device with ibdev->netdev == NULL > > > > > > > > After probe, The ib device has been registered, but we left it in > > > > invalid state. > > > > To fully complete the initialization, we should set the ibdev->netdev. > > > > > > > > And the newlink command in erdma only do one thing now: set the > > > > ibdev->netdev if the rule matches, and it is uncorrelated with the > > > > ib device lifecycle. > > > > > > This is not what newlink is for, it can't be used like this > > > > > > Jason > > > > > > [A typo in previous reply, so I re-edit it] > > > <...> > > Or, Is it ok that erdma registers a net notifier in our module to handle > > this? > > > > I think this twice, use a net notifier of module can not solve this, due > to the leak of all ib devices information in erdma module. I don't understand what this means > The solution may be simple: register net notifier per device in probe, > and call ib_device_set_netdev if matches in the notifier callback. That sounds wrong Jason
On 5/16/22 10:07 PM, Jason Gunthorpe wrote: > On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote: >> >> >> On 5/16/22 8:40 PM, Cheng Xu wrote: >>> >>> >>> On 5/16/22 7:49 PM, Jason Gunthorpe wrote: >>>> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote: >>>>> >>>>> >>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>>>>> >>>>>>> +static struct rdma_link_ops erdma_link_ops = { >>>>>>> + .type = "erdma", >>>>>>> + .newlink = erdma_newlink, >>>>>>> +}; >>>>>> >>>>>> Why is there still a newlink? >>>>>> >>>>>> Jason >>>>> >>>>> >>>>> Yeah, I remember your suggestion that the ibdev should keep the same >>>>> lifecycle with its PCI device, and so does it now. >>>>> >>>>> The initialization flow for erdma now: >>>>> probe: >>>>> - Hardware specified initialization >>>>> - IB device initialization >>>>> - Calling ib_register_device with ibdev->netdev == NULL >>>>> >>>>> After probe, The ib device has been registered, but we left it in >>>>> invalid state. >>>>> To fully complete the initialization, we should set the ibdev->netdev. >>>>> >>>>> And the newlink command in erdma only do one thing now: set the >>>>> ibdev->netdev if the rule matches, and it is uncorrelated with the >>>>> ib device lifecycle. >>>> >>>> This is not what newlink is for, it can't be used like this >>>> >>>> Jason >>> >>> >>> [A typo in previous reply, so I re-edit it] >>> >> <...> >>> Or, Is it ok that erdma registers a net notifier in our module to handle >>> this? >>> >> >> I think this twice, use a net notifier of module can not solve this, due >> to the leak of all ib devices information in erdma module. > > I don't understand what this means At first I want to register a global net notifier in our module, and link the ib device and net device if matches in the callback. But in the callback, we can not get the ib devices registered by erdma, because our driver module does not maintain the ib device list. This is not the right way. >> The solution may be simple: register net notifier per device in probe, >> and call ib_device_set_netdev if matches in the notifier callback. > > That sounds wrong > OK. So using rdma link command to handle the link relationship manually is still a better choice? Actually, erdma looks like a hardware implementation of siw/rxe (more likes siw, because erdma uses iWarp, and have similar CM implementation): the ethernet functionality is provided by existed net device, no matter what kind of the net device (due to virtio-net is de-facto in cloud virtualization, erdma is binded to virtio-net in practice). Due to this similarity, siw/rxe use 'rdma link' to link the ib device and net device (including create ib device), Does it sound reasonable that erdma use it to link the net deivce? Could you re-consider this way? Or could you give us some suggestion? I think this is the main and important issue for our upstreaming. I would be very grateful. Thanks, Cheng Xu
On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote: > > > I think this twice, use a net notifier of module can not solve this, due > > > to the leak of all ib devices information in erdma module. > > > > I don't understand what this means > At first I want to register a global net notifier in our module, and link > the ib device and net device if matches in the callback. But in the > callback, we can not get the ib devices registered by erdma, because our > driver module does not maintain the ib device list. This is not the > right way. The core has the device list, if you need to search it then you need to add a new searching function > > > > The solution may be simple: register net notifier per device in probe, > > > and call ib_device_set_netdev if matches in the notifier callback. > > > > That sounds wrong > > > > OK. So using rdma link command to handle the link relationship manually > is still a better choice? No > Actually, erdma looks like a hardware implementation of siw/rxe (more > likes siw, because erdma uses iWarp, and have similar CM > implementation): the ethernet functionality is provided by existed net > device, no matter what kind of the net device (due to virtio-net is > de-facto in cloud virtualization, erdma is binded to virtio-net in > practice). Due to this similarity, siw/rxe use 'rdma link' to link the > ib device and net device (including create ib device), Does it sound > reasonable that erdma use it to link the net deivce? It is not like siw because it cannot run on any netdevice, it can only run on the signle netdevice the HW is linked to in the background. Jason
On 5/17/22 1:31 AM, Jason Gunthorpe wrote: > On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote: >>>> I think this twice, use a net notifier of module can not solve this, due >>>> to the leak of all ib devices information in erdma module. >>> >>> I don't understand what this means >> At first I want to register a global net notifier in our module, and link >> the ib device and net device if matches in the callback. But in the >> callback, we can not get the ib devices registered by erdma, because our >> driver module does not maintain the ib device list. This is not the >> right way. > > The core has the device list, if you need to search it then you need > to add a new searching function > I had a mistake, instead of what I said, erdma should search net devices in a wq to find the matched net device. Not search ib devices in net notifier callback. I think is should be fine. >> >>>> The solution may be simple: register net notifier per device in probe, >>>> and call ib_device_set_netdev if matches in the notifier callback. >>> >>> That sounds wrong >>> >> >> OK. So using rdma link command to handle the link relationship manually >> is still a better choice? > > No > >> Actually, erdma looks like a hardware implementation of siw/rxe (more >> likes siw, because erdma uses iWarp, and have similar CM >> implementation): the ethernet functionality is provided by existed net >> device, no matter what kind of the net device (due to virtio-net is >> de-facto in cloud virtualization, erdma is binded to virtio-net in >> practice). Due to this similarity, siw/rxe use 'rdma link' to link the >> ib device and net device (including create ib device), Does it sound >> reasonable that erdma use it to link the net deivce? > > It is not like siw because it cannot run on any netdevice, it can only > run on the signle netdevice the HW is linked to in the background. > Yeah, you are exactly right, I omitted this detail in the last reply. Thanks, Cheng Xu > Jason
On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > >> +static struct rdma_link_ops erdma_link_ops = { >> + .type = "erdma", >> + .newlink = erdma_newlink, >> +}; > > Why is there still a newlink? > Hello, Jason, About this issue, I have another idea, more simple and reasonable. Maybe erdma driver doesn't need to link to a net device in kernel. In the core code, the ib_device_get_netdev has several use cases: 1). query port info in netlink 2). get eth speed for IB (ib_get_eth_speed) 3). enumerate all RoCE ports (ib_enum_roce_netdev) 4). iw_query_port The cases related to erdma is 4). But we change it in our patch 02/12. So, it seems all right that we do not link erdma to a net device. * I also test this solution, it works for both perftest and NoF. * Another issue is how to get the port state and attributes without net device. For this, erdma can get it from HW directly. So, I think this may be the final solution. (BTW, I have gone over the rdma drivers, EFA does in this way, it also has two separated devices for net and rdma. It inspired me). Thanks, Cheng Xu
On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote: > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > > > > > +static struct rdma_link_ops erdma_link_ops = { > > > + .type = "erdma", > > > + .newlink = erdma_newlink, > > > +}; > > > > Why is there still a newlink? > > > > Hello, Jason, > > About this issue, I have another idea, more simple and reasonable. > > Maybe erdma driver doesn't need to link to a net device in kernel. In > the core code, the ib_device_get_netdev has several use cases: > > 1). query port info in netlink > 2). get eth speed for IB (ib_get_eth_speed) > 3). enumerate all RoCE ports (ib_enum_roce_netdev) > 4). iw_query_port > > The cases related to erdma is 4). But we change it in our patch 02/12. > So, it seems all right that we do not link erdma to a net device. > > * I also test this solution, it works for both perftest and NoF. * > > Another issue is how to get the port state and attributes without > net device. For this, erdma can get it from HW directly. > > So, I think this may be the final solution. (BTW, I have gone over > the rdma drivers, EFA does in this way, it also has two separated > devices for net and rdma. It inspired me). I'm not sure this works for an iWarp device - various things expect to know the netdevice to know how to relate IP addresses to the iWarp stuff - but then I don't really know iWarp. If it works at all it is not a great idea. EFA gets by because it doesn't use IP addressing at all. Jason
On 5/18/22 10:46 PM, Jason Gunthorpe wrote: > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote: >> >> >> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>> >>>> +static struct rdma_link_ops erdma_link_ops = { >>>> + .type = "erdma", >>>> + .newlink = erdma_newlink, >>>> +}; >>> >>> Why is there still a newlink? >>> >> >> Hello, Jason, >> >> About this issue, I have another idea, more simple and reasonable. >> >> Maybe erdma driver doesn't need to link to a net device in kernel. In >> the core code, the ib_device_get_netdev has several use cases: >> >> 1). query port info in netlink >> 2). get eth speed for IB (ib_get_eth_speed) >> 3). enumerate all RoCE ports (ib_enum_roce_netdev) >> 4). iw_query_port >> >> The cases related to erdma is 4). But we change it in our patch 02/12. >> So, it seems all right that we do not link erdma to a net device. >> >> * I also test this solution, it works for both perftest and NoF. * >> >> Another issue is how to get the port state and attributes without >> net device. For this, erdma can get it from HW directly. >> >> So, I think this may be the final solution. (BTW, I have gone over >> the rdma drivers, EFA does in this way, it also has two separated >> devices for net and rdma. It inspired me). > > I'm not sure this works for an iWarp device - various things expect to > know the netdevice to know how to relate IP addresses to the iWarp > stuff - but then I don't really know iWarp. As far as I know, iWarp device only has one GID entry which generated from MAC address. For iWarp, The CM part in core code resolves address, finds route with the help of kernel's net subsystem, and then obtains the correct ibdev by GID matching. The GID matching in iWarp is indeed MAC address matching. In another words, for iWarp devices, the core code doesn't handle IP addressing related stuff directly, it is finished by calling net APIs. The netdev set by ib_device_set_netdev does not used in iWarp's CM process. The binded netdev in iWarp devices, mainly have two purposes: 1). generated GID0, using the netdev's mac address. 2). get the port state and attributes. For 1), erdma device binded to net device also by mac address, which can be obtained from our PCIe bar registers. For 2), erdma can also get the information, and may be more accurately. For example, erdma can have different MTU with virtio-net in our cloud. For RoCEv2, I know that it has many GIDs, some of them are generated from IP addresses, and handing IP addressing in core code. > If it works at all it is not a great idea As I explained above (I hope I explained clearly, but I'm a little worry about my English :) ), netdev binded by ib_device_set_netdev in iWarp device has limit usage. Compared with the v8 (traverse the netdevs in kernel and find the matched net device), I think don't set netdev is better. After I explained, how do you think about this? If it still not ok, I will work on the v8 in the future. Otherwise, I will send a v9 patch. Thanks, Cheng Xu
On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote: > > > On 5/18/22 10:46 PM, Jason Gunthorpe wrote: > > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote: > > > > > > > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > > > > > > > > > +static struct rdma_link_ops erdma_link_ops = { > > > > > + .type = "erdma", > > > > > + .newlink = erdma_newlink, > > > > > +}; > > > > > > > > Why is there still a newlink? > > > > > > > > > > Hello, Jason, > > > > > > About this issue, I have another idea, more simple and reasonable. > > > > > > Maybe erdma driver doesn't need to link to a net device in kernel. In > > > the core code, the ib_device_get_netdev has several use cases: > > > > > > 1). query port info in netlink > > > 2). get eth speed for IB (ib_get_eth_speed) > > > 3). enumerate all RoCE ports (ib_enum_roce_netdev) > > > 4). iw_query_port > > > > > > The cases related to erdma is 4). But we change it in our patch 02/12. > > > So, it seems all right that we do not link erdma to a net device. > > > > > > * I also test this solution, it works for both perftest and NoF. * > > > > > > Another issue is how to get the port state and attributes without > > > net device. For this, erdma can get it from HW directly. > > > > > > So, I think this may be the final solution. (BTW, I have gone over > > > the rdma drivers, EFA does in this way, it also has two separated > > > devices for net and rdma. It inspired me). > > > > I'm not sure this works for an iWarp device - various things expect to > > know the netdevice to know how to relate IP addresses to the iWarp > > stuff - but then I don't really know iWarp. > > As far as I know, iWarp device only has one GID entry which generated > from MAC address. > > For iWarp, The CM part in core code resolves address, finds > route with the help of kernel's net subsystem, and then obtains the correct > ibdev by GID matching. The GID matching in iWarp is indeed MAC address > matching. > > In another words, for iWarp devices, the core code doesn't handle IP > addressing related stuff directly, it is finished by calling net APIs. > The netdev set by ib_device_set_netdev does not used in iWarp's CM > process. > > The binded netdev in iWarp devices, mainly have two purposes: > 1). generated GID0, using the netdev's mac address. > 2). get the port state and attributes. > > For 1), erdma device binded to net device also by mac address, which can > be obtained from our PCIe bar registers. > For 2), erdma can also get the information, and may be more accurately. > For example, erdma can have different MTU with virtio-net in our cloud. > > For RoCEv2, I know that it has many GIDs, some of them are generated > from IP addresses, and handing IP addressing in core code. Bernard, Tom what do you think? Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Wednesday, 18 May 2022 18:32 > To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler > <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com> > Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org; > KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com > Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma > module > > On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote: > > > > > > On 5/18/22 10:46 PM, Jason Gunthorpe wrote: > > > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote: > > > > > > > > > > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote: > > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: > > > > > > > > > > > +static struct rdma_link_ops erdma_link_ops = { > > > > > > + .type = "erdma", > > > > > > + .newlink = erdma_newlink, > > > > > > +}; > > > > > > > > > > Why is there still a newlink? > > > > > > > > > > > > > Hello, Jason, > > > > > > > > About this issue, I have another idea, more simple and reasonable. > > > > > > > > Maybe erdma driver doesn't need to link to a net device in kernel. In > > > > the core code, the ib_device_get_netdev has several use cases: > > > > > > > > 1). query port info in netlink > > > > 2). get eth speed for IB (ib_get_eth_speed) > > > > 3). enumerate all RoCE ports (ib_enum_roce_netdev) > > > > 4). iw_query_port > > > > > > > > The cases related to erdma is 4). But we change it in our patch > 02/12. > > > > So, it seems all right that we do not link erdma to a net device. > > > > > > > > * I also test this solution, it works for both perftest and NoF. * > > > > > > > > Another issue is how to get the port state and attributes without > > > > net device. For this, erdma can get it from HW directly. > > > > > > > > So, I think this may be the final solution. (BTW, I have gone over > > > > the rdma drivers, EFA does in this way, it also has two separated > > > > devices for net and rdma. It inspired me). > > > > > > I'm not sure this works for an iWarp device - various things expect to > > > know the netdevice to know how to relate IP addresses to the iWarp > > > stuff - but then I don't really know iWarp. > > > > As far as I know, iWarp device only has one GID entry which generated > > from MAC address. > > > > For iWarp, The CM part in core code resolves address, finds > > route with the help of kernel's net subsystem, and then obtains the > correct > > ibdev by GID matching. The GID matching in iWarp is indeed MAC address > > matching. > > > > In another words, for iWarp devices, the core code doesn't handle IP > > addressing related stuff directly, it is finished by calling net APIs. > > The netdev set by ib_device_set_netdev does not used in iWarp's CM > > process. > > > > The binded netdev in iWarp devices, mainly have two purposes: > > 1). generated GID0, using the netdev's mac address. > > 2). get the port state and attributes. > > > > For 1), erdma device binded to net device also by mac address, which can > > be obtained from our PCIe bar registers. > > For 2), erdma can also get the information, and may be more accurately. > > For example, erdma can have different MTU with virtio-net in our cloud. > > > > For RoCEv2, I know that it has many GIDs, some of them are generated > > from IP addresses, and handing IP addressing in core code. > > Bernard, Tom what do you think? > > Jason I think iWarp (and now RoCEv2 with its UDP dependency) drivers produce GIDs mostly to satisfy the current RDMA CM infrastructure, which depends on this type of unique identifier, inherited from IB. Imo, more natural would be to implement IP based RDMA protocols connection management by relying on IP addresses. Sorry for asking again - why erdma does not need to link with netdev? Can erdma exist without using a netdev? Thanks, Bernard.
On 5/19/2022 12:20 PM, Bernard Metzler wrote: > > >> -----Original Message----- >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Wednesday, 18 May 2022 18:32 >> To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler >> <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com> >> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org; >> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com >> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma >> module >> >> On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote: >>> >>> >>> On 5/18/22 10:46 PM, Jason Gunthorpe wrote: >>>> On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote: >>>>> >>>>> >>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote: >>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote: >>>>>> >>>>>>> +static struct rdma_link_ops erdma_link_ops = { >>>>>>> + .type = "erdma", >>>>>>> + .newlink = erdma_newlink, >>>>>>> +}; >>>>>> >>>>>> Why is there still a newlink? >>>>>> >>>>> >>>>> Hello, Jason, >>>>> >>>>> About this issue, I have another idea, more simple and reasonable. >>>>> >>>>> Maybe erdma driver doesn't need to link to a net device in kernel. In >>>>> the core code, the ib_device_get_netdev has several use cases: >>>>> >>>>> 1). query port info in netlink >>>>> 2). get eth speed for IB (ib_get_eth_speed) >>>>> 3). enumerate all RoCE ports (ib_enum_roce_netdev) >>>>> 4). iw_query_port >>>>> >>>>> The cases related to erdma is 4). But we change it in our patch >> 02/12. >>>>> So, it seems all right that we do not link erdma to a net device. >>>>> >>>>> * I also test this solution, it works for both perftest and NoF. * >>>>> >>>>> Another issue is how to get the port state and attributes without >>>>> net device. For this, erdma can get it from HW directly. >>>>> >>>>> So, I think this may be the final solution. (BTW, I have gone over >>>>> the rdma drivers, EFA does in this way, it also has two separated >>>>> devices for net and rdma. It inspired me). >>>> >>>> I'm not sure this works for an iWarp device - various things expect to >>>> know the netdevice to know how to relate IP addresses to the iWarp >>>> stuff - but then I don't really know iWarp. >>> >>> As far as I know, iWarp device only has one GID entry which generated >>> from MAC address. >>> >>> For iWarp, The CM part in core code resolves address, finds >>> route with the help of kernel's net subsystem, and then obtains the >> correct >>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address >>> matching. >>> >>> In another words, for iWarp devices, the core code doesn't handle IP >>> addressing related stuff directly, it is finished by calling net APIs. >>> The netdev set by ib_device_set_netdev does not used in iWarp's CM >>> process. >>> >>> The binded netdev in iWarp devices, mainly have two purposes: >>> 1). generated GID0, using the netdev's mac address. >>> 2). get the port state and attributes. >>> >>> For 1), erdma device binded to net device also by mac address, which can >>> be obtained from our PCIe bar registers. >>> For 2), erdma can also get the information, and may be more accurately. >>> For example, erdma can have different MTU with virtio-net in our cloud. >>> >>> For RoCEv2, I know that it has many GIDs, some of them are generated >>> from IP addresses, and handing IP addressing in core code. >> >> Bernard, Tom what do you think? >> >> Jason > > I think iWarp (and now RoCEv2 with its UDP dependency) drivers > produce GIDs mostly to satisfy the current RDMA CM infrastructure, > which depends on this type of unique identifier, inherited from IB. > Imo, more natural would be to implement IP based RDMA protocols > connection management by relying on IP addresses. Agreed. Exposing MAC addresses for this seems so... 20th century. Tom. > Sorry for asking again - why erdma does not need to link with netdev? > Can erdma exist without using a netdev? > > > Thanks, > Bernard
On 5/20/22 12:20 AM, Bernard Metzler wrote: > > <...> >>> As far as I know, iWarp device only has one GID entry which generated >>> from MAC address. >>> >>> For iWarp, The CM part in core code resolves address, finds >>> route with the help of kernel's net subsystem, and then obtains the >> correct >>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address >>> matching. >>> >>> In another words, for iWarp devices, the core code doesn't handle IP >>> addressing related stuff directly, it is finished by calling net APIs. >>> The netdev set by ib_device_set_netdev does not used in iWarp's CM >>> process. >>> >>> The binded netdev in iWarp devices, mainly have two purposes: >>> 1). generated GID0, using the netdev's mac address. >>> 2). get the port state and attributes. >>> >>> For 1), erdma device binded to net device also by mac address, which can >>> be obtained from our PCIe bar registers. >>> For 2), erdma can also get the information, and may be more accurately. >>> For example, erdma can have different MTU with virtio-net in our cloud. >>> >>> For RoCEv2, I know that it has many GIDs, some of them are generated >>> from IP addresses, and handing IP addressing in core code. >> >> Bernard, Tom what do you think? >> >> Jason > > I think iWarp (and now RoCEv2 with its UDP dependency) drivers > produce GIDs mostly to satisfy the current RDMA CM infrastructure, > which depends on this type of unique identifier, inherited from IB. > Imo, more natural would be to implement IP based RDMA protocols > connection management by relying on IP addresses. > > Sorry for asking again - why erdma does not need to link with netdev? > Can erdma exist without using a netdev? Actually erdma also need a net device binded to, and so does it. These days I’m trying to find out acceptable ways to get the reference of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other RDMA drivers can get the reference of their binded netdevs' reference easily (most RDMA devices are based on the extended aux devices), it is a little more complex for erdma, because erdma and its binded net device are two separated PCIe devices. Then I find that the netdev reference hold in ibdev is rarely used in core code for iWarp deivces, GID0 is the key attribute (As you and Tom mentioned, it appears with the historical need for compatibility, but I think this is another story). So, there are two choices for erdma: enum net devices and find the matched one, or never calling ib_device_set_netdev. The second one has less code. The second way can't work in ROCE. But it works for iWarp (I've tested), since the netdev reference is rarely used for iWarp in core code, as I said in last reply. In short, the question discussed here is that: is it acceptable that doesn't hold the netdev reference in core code for a iWarp driver (indeed it has a netdev binded to) ? Or is it necessary that calling ib_device_set_netdev to set the binded netdev for iWarp driver? You and Tom both are specialists in iWarp, your opinions are important. Thanks very much Cheng Xu > > Thanks, > Bernard.
diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c new file mode 100644 index 000000000000..86c6f574fc68 --- /dev/null +++ b/drivers/infiniband/hw/erdma/erdma_main.c @@ -0,0 +1,625 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB + +/* Authors: Cheng Xu <chengyou@linux.alibaba.com> */ +/* Kai Shen <kaishen@linux.alibaba.com> */ +/* Copyright (c) 2020-2022, Alibaba Group. */ + +#include <linux/errno.h> +#include <linux/init.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/netdevice.h> +#include <linux/pci.h> +#include <net/addrconf.h> +#include <rdma/erdma-abi.h> +#include <rdma/ib_verbs.h> +#include <rdma/ib_user_verbs.h> + +#include "erdma.h" +#include "erdma_cm.h" +#include "erdma_hw.h" +#include "erdma_verbs.h" + +MODULE_AUTHOR("Cheng Xu <chengyou@linux.alibaba.com>"); +MODULE_DESCRIPTION("Alibaba elasticRDMA adapter driver"); +MODULE_LICENSE("Dual BSD/GPL"); + +static int erdma_device_register(struct erdma_dev *dev) +{ + struct ib_device *ibdev = &dev->ibdev; + int ret; + + memset(ibdev->name, 0, IB_DEVICE_NAME_MAX); + /* + * In Ali ECS environment, ENI's mac address is unique in VPC. + * So, generating the ibdev's name from mac address of the binded + * netdev. + */ + ret = snprintf(ibdev->name, IB_DEVICE_NAME_MAX, "%s_%.2x%.2x%.2x", + DRV_MODULE_NAME, dev->attrs.peer_addr[3], + dev->attrs.peer_addr[4], dev->attrs.peer_addr[5]); + if (ret < 0) + return ret; + + ibdev->phys_port_cnt = 1; + ret = ib_device_set_netdev(ibdev, NULL, 1); + if (ret) + return ret; + + ret = ib_register_device(ibdev, ibdev->name, &dev->pdev->dev); + if (ret) + dev_err(&dev->pdev->dev, + "ib_register_device(%s) failed: ret = %d\n", + ibdev->name, ret); + + return ret; +} + +static int erdma_netdev_event(struct notifier_block *nb, unsigned long event, + void *arg) +{ + struct net_device *netdev = netdev_notifier_info_to_dev(arg); + struct erdma_dev *dev = container_of(nb, struct erdma_dev, netdev_nb); + + if (dev->netdev == NULL || dev->netdev != netdev) + goto done; + + switch (event) { + case NETDEV_UP: + dev->state = IB_PORT_ACTIVE; + erdma_port_event(dev, IB_EVENT_PORT_ACTIVE); + break; + case NETDEV_DOWN: + dev->state = IB_PORT_DOWN; + erdma_port_event(dev, IB_EVENT_PORT_ERR); + break; + case NETDEV_REGISTER: + case NETDEV_UNREGISTER: + case NETDEV_CHANGEADDR: + case NETDEV_CHANGEMTU: + case NETDEV_GOING_DOWN: + case NETDEV_CHANGE: + default: + break; + } + +done: + return NOTIFY_OK; +} + +static int erdma_newlink(const char *dev_name, struct net_device *netdev) +{ + struct ib_device *ibdev; + struct erdma_dev *dev; + int ret; + + ibdev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_ERDMA); + if (ibdev) { + ib_device_put(ibdev); + return -EEXIST; + } + + ibdev = ib_device_get_by_name(dev_name, RDMA_DRIVER_ERDMA); + if (!ibdev) + return -ENODEV; + + dev = to_edev(ibdev); + + if (!ether_addr_equal_unaligned(netdev->perm_addr, + dev->attrs.peer_addr)) { + ret = -EINVAL; + goto out; + } + + addrconf_addr_eui48((u8 *)&dev->ibdev.node_guid, netdev->dev_addr); + ret = ib_device_set_netdev(&dev->ibdev, netdev, 1); + if (ret) + goto out; + + if (netif_running(netdev) && netif_carrier_ok(netdev)) + dev->state = IB_PORT_ACTIVE; + else + dev->state = IB_PORT_DOWN; + + dev->netdev_nb.notifier_call = erdma_netdev_event; + dev->netdev = netdev; + + ret = register_netdevice_notifier(&dev->netdev_nb); +out: + ib_device_put(ibdev); + return ret; +} + +static struct rdma_link_ops erdma_link_ops = { + .type = "erdma", + .newlink = erdma_newlink, +}; + +static irqreturn_t erdma_comm_irq_handler(int irq, void *data) +{ + struct erdma_dev *dev = data; + + erdma_cmdq_completion_handler(&dev->cmdq); + erdma_aeq_event_handler(dev); + + return IRQ_HANDLED; +} + +static void erdma_dwqe_resource_init(struct erdma_dev *dev) +{ + int total_pages, type0, type1; + + dev->attrs.grp_num = erdma_reg_read32(dev, ERDMA_REGS_GRP_NUM_REG); + + if (dev->attrs.grp_num < 4) + dev->attrs.disable_dwqe = true; + else + dev->attrs.disable_dwqe = false; + + /* One page contains 4 goups. */ + total_pages = dev->attrs.grp_num * 4; + + if (dev->attrs.grp_num >= ERDMA_DWQE_MAX_GRP_CNT) { + dev->attrs.grp_num = ERDMA_DWQE_MAX_GRP_CNT; + type0 = ERDMA_DWQE_TYPE0_CNT; + type1 = ERDMA_DWQE_TYPE1_CNT / ERDMA_DWQE_TYPE1_CNT_PER_PAGE; + } else { + type1 = total_pages / 3; + type0 = total_pages - type1 - 1; + } + + dev->attrs.dwqe_pages = type0; + dev->attrs.dwqe_entries = type1 * ERDMA_DWQE_TYPE1_CNT_PER_PAGE; +} + +static int erdma_request_vectors(struct erdma_dev *dev) +{ + int expect_irq_num = min(num_possible_cpus() + 1, ERDMA_NUM_MSIX_VEC); + + dev->attrs.irq_num = pci_alloc_irq_vectors(dev->pdev, 1, expect_irq_num, + PCI_IRQ_MSIX); + if (dev->attrs.irq_num <= 0) { + dev_err(&dev->pdev->dev, "request irq vectors failed(%d)\n", + dev->attrs.irq_num); + return -ENOSPC; + } + + return 0; +} + +static int erdma_comm_irq_init(struct erdma_dev *dev) +{ + snprintf(dev->comm_irq.name, ERDMA_IRQNAME_SIZE, "erdma-common@pci:%s", + pci_name(dev->pdev)); + dev->comm_irq.msix_vector = + pci_irq_vector(dev->pdev, ERDMA_MSIX_VECTOR_CMDQ); + + cpumask_set_cpu(cpumask_first(cpumask_of_pcibus(dev->pdev->bus)), + &dev->comm_irq.affinity_hint_mask); + irq_set_affinity_hint(dev->comm_irq.msix_vector, + &dev->comm_irq.affinity_hint_mask); + + return request_irq(dev->comm_irq.msix_vector, erdma_comm_irq_handler, 0, + dev->comm_irq.name, dev); +} + +static void erdma_comm_irq_uninit(struct erdma_dev *dev) +{ + irq_set_affinity_hint(dev->comm_irq.msix_vector, NULL); + free_irq(dev->comm_irq.msix_vector, dev); +} + +static int erdma_device_init(struct erdma_dev *dev, struct pci_dev *pdev) +{ + erdma_dwqe_resource_init(dev); + + return dma_set_mask_and_coherent(&pdev->dev, + DMA_BIT_MASK(ERDMA_PCI_WIDTH)); +} + +static void erdma_device_uninit(struct erdma_dev *dev) +{ + u32 ctrl = FIELD_PREP(ERDMA_REG_DEV_CTRL_RESET_MASK, 1); + + erdma_reg_write32(dev, ERDMA_REGS_DEV_CTRL_REG, ctrl); +} + +static const struct pci_device_id erdma_pci_tbl[] = { + { PCI_DEVICE(PCI_VENDOR_ID_ALIBABA, 0x107f) }, + {} +}; + +static int erdma_probe_dev(struct pci_dev *pdev) +{ + int err; + struct erdma_dev *dev; + u32 version; + int bars; + + err = pci_enable_device(pdev); + if (err) { + dev_err(&pdev->dev, "pci_enable_device failed(%d)\n", err); + return err; + } + + pci_set_master(pdev); + + dev = ib_alloc_device(erdma_dev, ibdev); + if (!dev) { + dev_err(&pdev->dev, "ib_alloc_device failed\n"); + err = -ENOMEM; + goto err_disable_device; + } + + pci_set_drvdata(pdev, dev); + dev->pdev = pdev; + dev->attrs.numa_node = dev_to_node(&pdev->dev); + + bars = pci_select_bars(pdev, IORESOURCE_MEM); + err = pci_request_selected_regions(pdev, bars, DRV_MODULE_NAME); + if (bars != ERDMA_BAR_MASK || err) { + err = err ? err : -EINVAL; + goto err_ib_device_release; + } + + dev->func_bar_addr = pci_resource_start(pdev, ERDMA_FUNC_BAR); + dev->func_bar_len = pci_resource_len(pdev, ERDMA_FUNC_BAR); + + dev->func_bar = + devm_ioremap(&pdev->dev, dev->func_bar_addr, dev->func_bar_len); + if (!dev->func_bar) { + dev_err(&pdev->dev, "devm_ioremap failed.\n"); + err = -EFAULT; + goto err_release_bars; + } + + version = erdma_reg_read32(dev, ERDMA_REGS_VERSION_REG); + if (version == 0) { + /* we knows that it is a non-functional function. */ + err = -ENODEV; + goto err_iounmap_func_bar; + } + + err = erdma_device_init(dev, pdev); + if (err) + goto err_iounmap_func_bar; + + err = erdma_request_vectors(dev); + if (err) + goto err_iounmap_func_bar; + + err = erdma_comm_irq_init(dev); + if (err) + goto err_free_vectors; + + err = erdma_aeq_init(dev); + if (err) + goto err_uninit_comm_irq; + + err = erdma_cmdq_init(dev); + if (err) + goto err_uninit_aeq; + + err = erdma_ceqs_init(dev); + if (err) + goto err_uninit_cmdq; + + erdma_finish_cmdq_init(dev); + + return 0; + +err_uninit_cmdq: + erdma_device_uninit(dev); + erdma_cmdq_destroy(dev); + +err_uninit_aeq: + erdma_aeq_destroy(dev); + +err_uninit_comm_irq: + erdma_comm_irq_uninit(dev); + +err_free_vectors: + pci_free_irq_vectors(dev->pdev); + +err_iounmap_func_bar: + devm_iounmap(&pdev->dev, dev->func_bar); + +err_release_bars: + pci_release_selected_regions(pdev, bars); + +err_ib_device_release: + ib_dealloc_device(&dev->ibdev); + +err_disable_device: + pci_disable_device(pdev); + + return err; +} + +static void erdma_remove_dev(struct pci_dev *pdev) +{ + struct erdma_dev *dev = pci_get_drvdata(pdev); + + erdma_ceqs_uninit(dev); + + erdma_device_uninit(dev); + + erdma_cmdq_destroy(dev); + erdma_aeq_destroy(dev); + erdma_comm_irq_uninit(dev); + pci_free_irq_vectors(dev->pdev); + + devm_iounmap(&pdev->dev, dev->func_bar); + pci_release_selected_regions(pdev, ERDMA_BAR_MASK); + + ib_dealloc_device(&dev->ibdev); + + pci_disable_device(pdev); +} + +#define ERDMA_GET_CAP(name, cap) FIELD_GET(ERDMA_CMD_DEV_CAP_##name##_MASK, cap) + +static int erdma_dev_attrs_init(struct erdma_dev *dev) +{ + int err; + u64 req_hdr, cap0, cap1; + + erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_RDMA, + CMDQ_OPCODE_QUERY_DEVICE); + + err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0, + &cap1); + if (err) + return err; + + dev->attrs.max_cqe = 1 << ERDMA_GET_CAP(MAX_CQE, cap0); + dev->attrs.max_mr_size = 1ULL << ERDMA_GET_CAP(MAX_MR_SIZE, cap0); + dev->attrs.max_mw = 1 << ERDMA_GET_CAP(MAX_MW, cap1); + dev->attrs.max_recv_wr = 1 << ERDMA_GET_CAP(MAX_RECV_WR, cap0); + dev->attrs.local_dma_key = ERDMA_GET_CAP(DMA_LOCAL_KEY, cap1); + dev->attrs.cc = ERDMA_GET_CAP(DEFAULT_CC, cap1); + dev->attrs.max_qp = ERDMA_NQP_PER_QBLOCK * ERDMA_GET_CAP(QBLOCK, cap1); + dev->attrs.max_mr = dev->attrs.max_qp << 1; + dev->attrs.max_cq = dev->attrs.max_qp << 1; + + dev->attrs.max_send_wr = ERDMA_MAX_SEND_WR; + dev->attrs.max_ord = ERDMA_MAX_ORD; + dev->attrs.max_ird = ERDMA_MAX_IRD; + dev->attrs.max_send_sge = ERDMA_MAX_SEND_SGE; + dev->attrs.max_recv_sge = ERDMA_MAX_RECV_SGE; + dev->attrs.max_sge_rd = ERDMA_MAX_SGE_RD; + dev->attrs.max_pd = ERDMA_MAX_PD; + + dev->res_cb[ERDMA_RES_TYPE_PD].max_cap = ERDMA_MAX_PD; + dev->res_cb[ERDMA_RES_TYPE_STAG_IDX].max_cap = dev->attrs.max_mr; + + erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_COMMON, + CMDQ_OPCODE_QUERY_FW_INFO); + + err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0, + &cap1); + + dev->attrs.fw_version = FIELD_GET(ERDMA_CMD_INFO0_FW_VER_MASK, cap0); + + return err; +} + +static int erdma_res_cb_init(struct erdma_dev *dev) +{ + int i, j; + + for (i = 0; i < ERDMA_RES_CNT; i++) { + dev->res_cb[i].next_alloc_idx = 1; + spin_lock_init(&dev->res_cb[i].lock); + dev->res_cb[i].bitmap = + kcalloc(BITS_TO_LONGS(dev->res_cb[i].max_cap), + sizeof(unsigned long), GFP_KERNEL); + /* We will free the memory in erdma_res_cb_free */ + if (!dev->res_cb[i].bitmap) + goto err; + } + + return 0; + +err: + for (j = 0; j < i; j++) + kfree(dev->res_cb[j].bitmap); + + return -ENOMEM; +} + +static void erdma_res_cb_free(struct erdma_dev *dev) +{ + int i; + + for (i = 0; i < ERDMA_RES_CNT; i++) + kfree(dev->res_cb[i].bitmap); +} + +static const struct ib_device_ops erdma_device_ops = { + .owner = THIS_MODULE, + .driver_id = RDMA_DRIVER_ERDMA, + .uverbs_abi_ver = ERDMA_ABI_VERSION, + + .alloc_mr = erdma_ib_alloc_mr, + .alloc_pd = erdma_alloc_pd, + .alloc_ucontext = erdma_alloc_ucontext, + .create_cq = erdma_create_cq, + .create_qp = erdma_create_qp, + .dealloc_pd = erdma_dealloc_pd, + .dealloc_ucontext = erdma_dealloc_ucontext, + .dereg_mr = erdma_dereg_mr, + .destroy_cq = erdma_destroy_cq, + .destroy_qp = erdma_destroy_qp, + .get_dma_mr = erdma_get_dma_mr, + .get_port_immutable = erdma_get_port_immutable, + .iw_accept = erdma_accept, + .iw_add_ref = erdma_qp_get_ref, + .iw_connect = erdma_connect, + .iw_create_listen = erdma_create_listen, + .iw_destroy_listen = erdma_destroy_listen, + .iw_get_qp = erdma_get_ibqp, + .iw_reject = erdma_reject, + .iw_rem_ref = erdma_qp_put_ref, + .map_mr_sg = erdma_map_mr_sg, + .mmap = erdma_mmap, + .mmap_free = erdma_mmap_free, + .modify_qp = erdma_modify_qp, + .post_recv = erdma_post_recv, + .post_send = erdma_post_send, + .poll_cq = erdma_poll_cq, + .query_device = erdma_query_device, + .query_gid = erdma_query_gid, + .query_port = erdma_query_port, + .query_qp = erdma_query_qp, + .req_notify_cq = erdma_req_notify_cq, + .reg_user_mr = erdma_reg_user_mr, + + INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq), + INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd), + INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext), + INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp), +}; + +static int erdma_ib_device_add(struct pci_dev *pdev) +{ + struct erdma_dev *dev = pci_get_drvdata(pdev); + struct ib_device *ibdev = &dev->ibdev; + u64 mac; + int ret = 0; + + ret = erdma_dev_attrs_init(dev); + if (ret) + return ret; + + ibdev->node_type = RDMA_NODE_RNIC; + memcpy(ibdev->node_desc, ERDMA_NODE_DESC, sizeof(ERDMA_NODE_DESC)); + + /* + * Current model (one-to-one device association): + * One ERDMA device per net_device or, equivalently, + * per physical port. + */ + ibdev->phys_port_cnt = 1; + ibdev->num_comp_vectors = dev->attrs.irq_num - 1; + + ib_set_device_ops(ibdev, &erdma_device_ops); + + INIT_LIST_HEAD(&dev->cep_list); + + spin_lock_init(&dev->lock); + xa_init_flags(&dev->qp_xa, XA_FLAGS_ALLOC1); + xa_init_flags(&dev->cq_xa, XA_FLAGS_ALLOC1); + dev->next_alloc_cqn = 1; + dev->next_alloc_qpn = 1; + + ret = erdma_res_cb_init(dev); + if (ret) + return ret; + + spin_lock_init(&dev->db_bitmap_lock); + bitmap_zero(dev->sdb_page, ERDMA_DWQE_TYPE0_CNT); + bitmap_zero(dev->sdb_entry, ERDMA_DWQE_TYPE1_CNT); + + atomic_set(&dev->num_ctx, 0); + + mac = erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_L_REG); + mac |= (u64)erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_H_REG) << 32; + + u64_to_ether_addr(mac, dev->attrs.peer_addr); + + ret = erdma_device_register(dev); + if (ret) + goto err_out; + + return 0; + +err_out: + xa_destroy(&dev->qp_xa); + xa_destroy(&dev->cq_xa); + + erdma_res_cb_free(dev); + + return ret; +} + +static void erdma_ib_device_remove(struct pci_dev *pdev) +{ + struct erdma_dev *dev = pci_get_drvdata(pdev); + + if (dev->netdev) { + ib_device_set_netdev(&dev->ibdev, NULL, 1); + dev->netdev = NULL; + unregister_netdevice_notifier(&dev->netdev_nb); + } + + ib_unregister_device(&dev->ibdev); + + erdma_res_cb_free(dev); + xa_destroy(&dev->qp_xa); + xa_destroy(&dev->cq_xa); +} + +static int erdma_probe(struct pci_dev *pdev, const struct pci_device_id *ent) +{ + int ret; + + ret = erdma_probe_dev(pdev); + if (ret) + return ret; + + ret = erdma_ib_device_add(pdev); + if (ret) { + erdma_remove_dev(pdev); + return ret; + } + + return 0; +} + +static void erdma_remove(struct pci_dev *pdev) +{ + erdma_ib_device_remove(pdev); + erdma_remove_dev(pdev); +} + +static struct pci_driver erdma_pci_driver = { + .name = DRV_MODULE_NAME, + .id_table = erdma_pci_tbl, + .probe = erdma_probe, + .remove = erdma_remove +}; + +MODULE_DEVICE_TABLE(pci, erdma_pci_tbl); + +static __init int erdma_init_module(void) +{ + int ret; + + ret = erdma_cm_init(); + if (ret) + return ret; + + ret = pci_register_driver(&erdma_pci_driver); + if (ret) { + erdma_cm_exit(); + return ret; + } + + rdma_link_register(&erdma_link_ops); + + return 0; +} + +static void __exit erdma_exit_module(void) +{ + rdma_link_unregister(&erdma_link_ops); + + pci_unregister_driver(&erdma_pci_driver); + + erdma_cm_exit(); +} + +module_init(erdma_init_module); +module_exit(erdma_exit_module);
Add the main erdma module, which provides interface to infiniband subsystem. Also fix a bug reported by kernel test robot. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> --- drivers/infiniband/hw/erdma/erdma_main.c | 625 +++++++++++++++++++++++ 1 file changed, 625 insertions(+) create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c