diff mbox series

[for-next,v7,10/12] RDMA/erdma: Add the erdma module

Message ID 20220421071747.1892-11-chengyou@linux.alibaba.com (mailing list archive)
State Superseded
Delegated to: Jason Gunthorpe
Headers show
Series Elastic RDMA Adapter (ERDMA) driver | expand

Commit Message

Cheng Xu April 21, 2022, 7:17 a.m. UTC
Add the main erdma module, which provides interface to infiniband
subsystem. Also fix a bug reported by kernel test robot.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_main.c | 625 +++++++++++++++++++++++
 1 file changed, 625 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c

Comments

Jason Gunthorpe May 10, 2022, 1:17 p.m. UTC | #1
On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:

> +static struct rdma_link_ops erdma_link_ops = {
> +	.type = "erdma",
> +	.newlink = erdma_newlink,
> +};

Why is there still a newlink?

Jason
Cheng Xu May 16, 2022, 3:15 a.m. UTC | #2
On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> 
>> +static struct rdma_link_ops erdma_link_ops = {
>> +	.type = "erdma",
>> +	.newlink = erdma_newlink,
>> +};
> 
> Why is there still a newlink?
> 
> Jason


Yeah, I remember your suggestion that the ibdev should keep the same
lifecycle with its PCI device, and so does it now.

The initialization flow for erdma now:
      probe:
        - Hardware specified initialization
        - IB device initialization
        - Calling ib_register_device with ibdev->netdev == NULL

After probe, The ib device has been registered, but we left it in
invalid state.
To fully complete the initialization, we should set the ibdev->netdev.

And the newlink command in erdma only do one thing now: set the 
ibdev->netdev if the rule matches, and it is uncorrelated with the
ib device lifecycle.

I think this may be the best way to link the netdev to ibdev for erdma.
Using netdev notifier is not good because we should register a netdev
notifier for the module.

If this way is not proper or there are some better ways, please let me
know.

Thanks,
Cheng Xu
Jason Gunthorpe May 16, 2022, 11:49 a.m. UTC | #3
On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
> 
> 
> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > 
> > > +static struct rdma_link_ops erdma_link_ops = {
> > > +	.type = "erdma",
> > > +	.newlink = erdma_newlink,
> > > +};
> > 
> > Why is there still a newlink?
> > 
> > Jason
> 
> 
> Yeah, I remember your suggestion that the ibdev should keep the same
> lifecycle with its PCI device, and so does it now.
> 
> The initialization flow for erdma now:
>      probe:
>        - Hardware specified initialization
>        - IB device initialization
>        - Calling ib_register_device with ibdev->netdev == NULL
> 
> After probe, The ib device has been registered, but we left it in
> invalid state.
> To fully complete the initialization, we should set the ibdev->netdev.
> 
> And the newlink command in erdma only do one thing now: set the
> ibdev->netdev if the rule matches, and it is uncorrelated with the
> ib device lifecycle.

This is not what newlink is for, it can't be used like this

Jason
Cheng Xu May 16, 2022, 12:37 p.m. UTC | #4
On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>> Jason
>>
>>
>> Yeah, I remember your suggestion that the ibdev should keep the same
>> lifecycle with its PCI device, and so does it now.
>>
>> The initialization flow for erdma now:
>>       probe:
>>         - Hardware specified initialization
>>         - IB device initialization
>>         - Calling ib_register_device with ibdev->netdev == NULL
>>
>> After probe, The ib device has been registered, but we left it in
>> invalid state.
>> To fully complete the initialization, we should set the ibdev->netdev.
>>
>> And the newlink command in erdma only do one thing now: set the
>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>> ib device lifecycle.
> 
> This is not what newlink is for, it can't be used like this
> 
> Jason

Yeah, the newlink is only used by siw and rxe to control their ib
device lifecycle new

But for the usage of newlink, it is triggered by 'rdma link add'
command. In my opinion, the 'rdma link' command set is all about
display (rdma link show) or management of the link relationship between
ib device and net device (rdma link add/del). And what erdma needs is
setting up the link between ib device and net device.

So, though the current usage is controling the lifecycle of siw/rxe, but
for the command's naming and connotation, should it be OK to expand the 
usage of newlink to meet our requirement?

Or, Is it ok that erdma registers a net notifier in our module to handle
this?

Thanks,
Cheng Xu
Cheng Xu May 16, 2022, 12:40 p.m. UTC | #5
On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>> Jason
>>
>>
>> Yeah, I remember your suggestion that the ibdev should keep the same
>> lifecycle with its PCI device, and so does it now.
>>
>> The initialization flow for erdma now:
>>       probe:
>>         - Hardware specified initialization
>>         - IB device initialization
>>         - Calling ib_register_device with ibdev->netdev == NULL
>>
>> After probe, The ib device has been registered, but we left it in
>> invalid state.
>> To fully complete the initialization, we should set the ibdev->netdev.
>>
>> And the newlink command in erdma only do one thing now: set the
>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>> ib device lifecycle.
> 
> This is not what newlink is for, it can't be used like this
> 
> Jason


[A typo in previous reply, so I re-edit it]

Yeah, the newlink is only used by siw and rxe to control their ib
device lifecycle now.

But for the usage of newlink, it is triggered by 'rdma link add'
command. In my opinion, the 'rdma link' command set is all about
display (rdma link show) or management of the link relationship between
ib device and net device (rdma link add/del). And what erdma needs is
setting up the link between ib device and net device.

So, though the current usage is controling the lifecycle of siw/rxe, but
for the command's naming and connotation, should it be OK to expand the
usage of newlink to meet our requirement?

Or, Is it ok that erdma registers a net notifier in our module to handle
this?

Thanks,
Cheng Xu
Cheng Xu May 16, 2022, 1:59 p.m. UTC | #6
On 5/16/22 8:40 PM, Cheng Xu wrote:
> 
> 
> On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
>> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>>
>>>
>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>
>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>> +    .type = "erdma",
>>>>> +    .newlink = erdma_newlink,
>>>>> +};
>>>>
>>>> Why is there still a newlink?
>>>>
>>>> Jason
>>>
>>>
>>> Yeah, I remember your suggestion that the ibdev should keep the same
>>> lifecycle with its PCI device, and so does it now.
>>>
>>> The initialization flow for erdma now:
>>>       probe:
>>>         - Hardware specified initialization
>>>         - IB device initialization
>>>         - Calling ib_register_device with ibdev->netdev == NULL
>>>
>>> After probe, The ib device has been registered, but we left it in
>>> invalid state.
>>> To fully complete the initialization, we should set the ibdev->netdev.
>>>
>>> And the newlink command in erdma only do one thing now: set the
>>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>>> ib device lifecycle.
>>
>> This is not what newlink is for, it can't be used like this
>>
>> Jason
> 
> 
> [A typo in previous reply, so I re-edit it]
> 
<...>
> Or, Is it ok that erdma registers a net notifier in our module to handle
> this?
> 

I think this twice, use a net notifier of module can not solve this, due
to the leak of all ib devices information in erdma module.

The solution may be simple: register net notifier per device in probe,
and call ib_device_set_netdev if matches in the notifier callback.

Should this be the right way?

Thanks,
Cheng Xu
Jason Gunthorpe May 16, 2022, 2:07 p.m. UTC | #7
On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote:
> 
> 
> On 5/16/22 8:40 PM, Cheng Xu wrote:
> > 
> > 
> > On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
> > > On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
> > > > 
> > > > 
> > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > > 
> > > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > > +    .type = "erdma",
> > > > > > +    .newlink = erdma_newlink,
> > > > > > +};
> > > > > 
> > > > > Why is there still a newlink?
> > > > > 
> > > > > Jason
> > > > 
> > > > 
> > > > Yeah, I remember your suggestion that the ibdev should keep the same
> > > > lifecycle with its PCI device, and so does it now.
> > > > 
> > > > The initialization flow for erdma now:
> > > >       probe:
> > > >         - Hardware specified initialization
> > > >         - IB device initialization
> > > >         - Calling ib_register_device with ibdev->netdev == NULL
> > > > 
> > > > After probe, The ib device has been registered, but we left it in
> > > > invalid state.
> > > > To fully complete the initialization, we should set the ibdev->netdev.
> > > > 
> > > > And the newlink command in erdma only do one thing now: set the
> > > > ibdev->netdev if the rule matches, and it is uncorrelated with the
> > > > ib device lifecycle.
> > > 
> > > This is not what newlink is for, it can't be used like this
> > > 
> > > Jason
> > 
> > 
> > [A typo in previous reply, so I re-edit it]
> > 
> <...>
> > Or, Is it ok that erdma registers a net notifier in our module to handle
> > this?
> > 
> 
> I think this twice, use a net notifier of module can not solve this, due
> to the leak of all ib devices information in erdma module.

I don't understand what this means

> The solution may be simple: register net notifier per device in probe,
> and call ib_device_set_netdev if matches in the notifier callback.

That sounds wrong

Jason
Cheng Xu May 16, 2022, 3:14 p.m. UTC | #8
On 5/16/22 10:07 PM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 09:59:28PM +0800, Cheng Xu wrote:
>>
>>
>> On 5/16/22 8:40 PM, Cheng Xu wrote:
>>>
>>>
>>> On 5/16/22 7:49 PM, Jason Gunthorpe wrote:
>>>> On Mon, May 16, 2022 at 11:15:32AM +0800, Cheng Xu wrote:
>>>>>
>>>>>
>>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>>>
>>>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>>>> +    .type = "erdma",
>>>>>>> +    .newlink = erdma_newlink,
>>>>>>> +};
>>>>>>
>>>>>> Why is there still a newlink?
>>>>>>
>>>>>> Jason
>>>>>
>>>>>
>>>>> Yeah, I remember your suggestion that the ibdev should keep the same
>>>>> lifecycle with its PCI device, and so does it now.
>>>>>
>>>>> The initialization flow for erdma now:
>>>>>        probe:
>>>>>          - Hardware specified initialization
>>>>>          - IB device initialization
>>>>>          - Calling ib_register_device with ibdev->netdev == NULL
>>>>>
>>>>> After probe, The ib device has been registered, but we left it in
>>>>> invalid state.
>>>>> To fully complete the initialization, we should set the ibdev->netdev.
>>>>>
>>>>> And the newlink command in erdma only do one thing now: set the
>>>>> ibdev->netdev if the rule matches, and it is uncorrelated with the
>>>>> ib device lifecycle.
>>>>
>>>> This is not what newlink is for, it can't be used like this
>>>>
>>>> Jason
>>>
>>>
>>> [A typo in previous reply, so I re-edit it]
>>>
>> <...>
>>> Or, Is it ok that erdma registers a net notifier in our module to handle
>>> this?
>>>
>>
>> I think this twice, use a net notifier of module can not solve this, due
>> to the leak of all ib devices information in erdma module.
> 
> I don't understand what this means
At first I want to register a global net notifier in our module, and 
link the ib device and net device if matches in the callback. But in the
callback, we can not get the ib devices registered by erdma, because our
driver module does not maintain the ib device list. This is not the
right way.

>> The solution may be simple: register net notifier per device in probe,
>> and call ib_device_set_netdev if matches in the notifier callback.
> 
> That sounds wrong
> 

OK. So using rdma link command to handle the link relationship manually
is still a better choice?

Actually, erdma looks like a hardware implementation of siw/rxe (more
likes siw, because erdma uses iWarp, and have similar CM
implementation): the ethernet functionality is provided by existed net
device, no matter what kind of the net device (due to virtio-net is
de-facto in cloud virtualization, erdma is binded to virtio-net in
practice). Due to this similarity, siw/rxe use 'rdma link' to link the
ib device and net device (including create ib device), Does it sound
reasonable that erdma use it to link the net deivce?

Could you re-consider this way? Or could you give us some suggestion? I
think this is the main and important issue for our upstreaming. I would
be very grateful.

Thanks,
Cheng Xu
Jason Gunthorpe May 16, 2022, 5:31 p.m. UTC | #9
On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote:
> > > I think this twice, use a net notifier of module can not solve this, due
> > > to the leak of all ib devices information in erdma module.
> > 
> > I don't understand what this means
> At first I want to register a global net notifier in our module, and link
> the ib device and net device if matches in the callback. But in the
> callback, we can not get the ib devices registered by erdma, because our
> driver module does not maintain the ib device list. This is not the
> right way.

The core has the device list, if you need to search it then you need
to add a new searching function

> 
> > > The solution may be simple: register net notifier per device in probe,
> > > and call ib_device_set_netdev if matches in the notifier callback.
> > 
> > That sounds wrong
> > 
> 
> OK. So using rdma link command to handle the link relationship manually
> is still a better choice?

No
 
> Actually, erdma looks like a hardware implementation of siw/rxe (more
> likes siw, because erdma uses iWarp, and have similar CM
> implementation): the ethernet functionality is provided by existed net
> device, no matter what kind of the net device (due to virtio-net is
> de-facto in cloud virtualization, erdma is binded to virtio-net in
> practice). Due to this similarity, siw/rxe use 'rdma link' to link the
> ib device and net device (including create ib device), Does it sound
> reasonable that erdma use it to link the net deivce?

It is not like siw because it cannot run on any netdevice, it can only
run on the signle netdevice the HW is linked to in the background.

Jason
Cheng Xu May 17, 2022, 1:53 a.m. UTC | #10
On 5/17/22 1:31 AM, Jason Gunthorpe wrote:
> On Mon, May 16, 2022 at 11:14:24PM +0800, Cheng Xu wrote:
>>>> I think this twice, use a net notifier of module can not solve this, due
>>>> to the leak of all ib devices information in erdma module.
>>>
>>> I don't understand what this means
>> At first I want to register a global net notifier in our module, and link
>> the ib device and net device if matches in the callback. But in the
>> callback, we can not get the ib devices registered by erdma, because our
>> driver module does not maintain the ib device list. This is not the
>> right way.
> 
> The core has the device list, if you need to search it then you need
> to add a new searching function
> 

I had a mistake, instead of what I said, erdma should search net devices
in a wq to find the matched net device. Not search ib devices in net
notifier callback.

I think is should be fine.

>>
>>>> The solution may be simple: register net notifier per device in probe,
>>>> and call ib_device_set_netdev if matches in the notifier callback.
>>>
>>> That sounds wrong
>>>
>>
>> OK. So using rdma link command to handle the link relationship manually
>> is still a better choice?
> 
> No
>   
>> Actually, erdma looks like a hardware implementation of siw/rxe (more
>> likes siw, because erdma uses iWarp, and have similar CM
>> implementation): the ethernet functionality is provided by existed net
>> device, no matter what kind of the net device (due to virtio-net is
>> de-facto in cloud virtualization, erdma is binded to virtio-net in
>> practice). Due to this similarity, siw/rxe use 'rdma link' to link the
>> ib device and net device (including create ib device), Does it sound
>> reasonable that erdma use it to link the net deivce?
> 
> It is not like siw because it cannot run on any netdevice, it can only
> run on the signle netdevice the HW is linked to in the background.
> 

Yeah, you are exactly right, I omitted this detail in the last
reply.

Thanks,
Cheng Xu

> Jason
Cheng Xu May 18, 2022, 8:30 a.m. UTC | #11
On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> 
>> +static struct rdma_link_ops erdma_link_ops = {
>> +	.type = "erdma",
>> +	.newlink = erdma_newlink,
>> +};
> 
> Why is there still a newlink?
> 

Hello, Jason,

About this issue, I have another idea, more simple and reasonable.

Maybe erdma driver doesn't need to link to a net device in kernel. In
the core code, the ib_device_get_netdev has several use cases:

   1). query port info in netlink
   2). get eth speed for IB (ib_get_eth_speed)
   3). enumerate all RoCE ports (ib_enum_roce_netdev)
   4). iw_query_port

The cases related to erdma is 4). But we change it in our patch 02/12.
So, it seems all right that we do not link erdma to a net device.

* I also test this solution, it works for both perftest and NoF. *

Another issue is how to get the port state and attributes without
net device. For this, erdma can get it from HW directly.

So, I think this may be the final solution. (BTW, I have gone over
the rdma drivers, EFA does in this way, it also has two separated
devices for net and rdma. It inspired me).

Thanks,
Cheng Xu
Jason Gunthorpe May 18, 2022, 2:46 p.m. UTC | #12
On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> 
> 
> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > 
> > > +static struct rdma_link_ops erdma_link_ops = {
> > > +	.type = "erdma",
> > > +	.newlink = erdma_newlink,
> > > +};
> > 
> > Why is there still a newlink?
> > 
> 
> Hello, Jason,
> 
> About this issue, I have another idea, more simple and reasonable.
> 
> Maybe erdma driver doesn't need to link to a net device in kernel. In
> the core code, the ib_device_get_netdev has several use cases:
> 
>   1). query port info in netlink
>   2). get eth speed for IB (ib_get_eth_speed)
>   3). enumerate all RoCE ports (ib_enum_roce_netdev)
>   4). iw_query_port
> 
> The cases related to erdma is 4). But we change it in our patch 02/12.
> So, it seems all right that we do not link erdma to a net device.
> 
> * I also test this solution, it works for both perftest and NoF. *
> 
> Another issue is how to get the port state and attributes without
> net device. For this, erdma can get it from HW directly.
> 
> So, I think this may be the final solution. (BTW, I have gone over
> the rdma drivers, EFA does in this way, it also has two separated
> devices for net and rdma. It inspired me).

I'm not sure this works for an iWarp device - various things expect to
know the netdevice to know how to relate IP addresses to the iWarp
stuff - but then I don't really know iWarp.

If it works at all it is not a great idea.

EFA gets by because it doesn't use IP addressing at all.

Jason
Cheng Xu May 18, 2022, 4:24 p.m. UTC | #13
On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
>>
>>
>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>
>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>> +	.type = "erdma",
>>>> +	.newlink = erdma_newlink,
>>>> +};
>>>
>>> Why is there still a newlink?
>>>
>>
>> Hello, Jason,
>>
>> About this issue, I have another idea, more simple and reasonable.
>>
>> Maybe erdma driver doesn't need to link to a net device in kernel. In
>> the core code, the ib_device_get_netdev has several use cases:
>>
>>    1). query port info in netlink
>>    2). get eth speed for IB (ib_get_eth_speed)
>>    3). enumerate all RoCE ports (ib_enum_roce_netdev)
>>    4). iw_query_port
>>
>> The cases related to erdma is 4). But we change it in our patch 02/12.
>> So, it seems all right that we do not link erdma to a net device.
>>
>> * I also test this solution, it works for both perftest and NoF. *
>>
>> Another issue is how to get the port state and attributes without
>> net device. For this, erdma can get it from HW directly.
>>
>> So, I think this may be the final solution. (BTW, I have gone over
>> the rdma drivers, EFA does in this way, it also has two separated
>> devices for net and rdma. It inspired me).
> 
> I'm not sure this works for an iWarp device - various things expect to
> know the netdevice to know how to relate IP addresses to the iWarp
> stuff - but then I don't really know iWarp.

As far as I know, iWarp device only has one GID entry which generated
from MAC address.

For iWarp, The CM part in core code resolves address, finds
route with the help of kernel's net subsystem, and then obtains the 
correct ibdev by GID matching. The GID matching in iWarp is indeed MAC 
address matching.

In another words, for iWarp devices, the core code doesn't handle IP 
addressing related stuff directly, it is finished by calling net APIs.
The netdev set by ib_device_set_netdev does not used in iWarp's CM
process.

The binded netdev in iWarp devices, mainly have two purposes:
   1). generated GID0, using the netdev's mac address.
   2). get the port state and attributes.

For 1), erdma device binded to net device also by mac address, which can
be obtained from our PCIe bar registers.
For 2), erdma can also get the information, and may be more accurately.
For example, erdma can have different MTU with virtio-net in our cloud.

For RoCEv2, I know that it has many GIDs, some of them are generated
from IP addresses, and handing IP addressing in core code.

> If it works at all it is not a great idea
As I explained above (I hope I explained clearly, but I'm a little worry 
about my English :) ), netdev binded by ib_device_set_netdev in iWarp
device has limit usage. Compared with the v8 (traverse the netdevs in
kernel and find the matched net device), I think don't set netdev is
better.

After I explained, how do you think about this? If it still not ok,
I will work on the v8 in the future. Otherwise, I will send a v9 patch.

Thanks,
Cheng Xu
Jason Gunthorpe May 18, 2022, 4:31 p.m. UTC | #14
On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
> 
> 
> On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> > > 
> > > 
> > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > 
> > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > +	.type = "erdma",
> > > > > +	.newlink = erdma_newlink,
> > > > > +};
> > > > 
> > > > Why is there still a newlink?
> > > > 
> > > 
> > > Hello, Jason,
> > > 
> > > About this issue, I have another idea, more simple and reasonable.
> > > 
> > > Maybe erdma driver doesn't need to link to a net device in kernel. In
> > > the core code, the ib_device_get_netdev has several use cases:
> > > 
> > >    1). query port info in netlink
> > >    2). get eth speed for IB (ib_get_eth_speed)
> > >    3). enumerate all RoCE ports (ib_enum_roce_netdev)
> > >    4). iw_query_port
> > > 
> > > The cases related to erdma is 4). But we change it in our patch 02/12.
> > > So, it seems all right that we do not link erdma to a net device.
> > > 
> > > * I also test this solution, it works for both perftest and NoF. *
> > > 
> > > Another issue is how to get the port state and attributes without
> > > net device. For this, erdma can get it from HW directly.
> > > 
> > > So, I think this may be the final solution. (BTW, I have gone over
> > > the rdma drivers, EFA does in this way, it also has two separated
> > > devices for net and rdma. It inspired me).
> > 
> > I'm not sure this works for an iWarp device - various things expect to
> > know the netdevice to know how to relate IP addresses to the iWarp
> > stuff - but then I don't really know iWarp.
> 
> As far as I know, iWarp device only has one GID entry which generated
> from MAC address.
> 
> For iWarp, The CM part in core code resolves address, finds
> route with the help of kernel's net subsystem, and then obtains the correct
> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
> matching.
> 
> In another words, for iWarp devices, the core code doesn't handle IP
> addressing related stuff directly, it is finished by calling net APIs.
> The netdev set by ib_device_set_netdev does not used in iWarp's CM
> process.
> 
> The binded netdev in iWarp devices, mainly have two purposes:
>   1). generated GID0, using the netdev's mac address.
>   2). get the port state and attributes.
> 
> For 1), erdma device binded to net device also by mac address, which can
> be obtained from our PCIe bar registers.
> For 2), erdma can also get the information, and may be more accurately.
> For example, erdma can have different MTU with virtio-net in our cloud.
> 
> For RoCEv2, I know that it has many GIDs, some of them are generated
> from IP addresses, and handing IP addressing in core code.

Bernard, Tom what do you think? 

Jason
Bernard Metzler May 19, 2022, 4:20 p.m. UTC | #15
> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, 18 May 2022 18:32
> To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler
> <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com>
> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma
> module
> 
> On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
> >
> >
> > On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
> > > On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
> > > >
> > > >
> > > > On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
> > > > > On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
> > > > >
> > > > > > +static struct rdma_link_ops erdma_link_ops = {
> > > > > > +	.type = "erdma",
> > > > > > +	.newlink = erdma_newlink,
> > > > > > +};
> > > > >
> > > > > Why is there still a newlink?
> > > > >
> > > >
> > > > Hello, Jason,
> > > >
> > > > About this issue, I have another idea, more simple and reasonable.
> > > >
> > > > Maybe erdma driver doesn't need to link to a net device in kernel. In
> > > > the core code, the ib_device_get_netdev has several use cases:
> > > >
> > > >    1). query port info in netlink
> > > >    2). get eth speed for IB (ib_get_eth_speed)
> > > >    3). enumerate all RoCE ports (ib_enum_roce_netdev)
> > > >    4). iw_query_port
> > > >
> > > > The cases related to erdma is 4). But we change it in our patch
> 02/12.
> > > > So, it seems all right that we do not link erdma to a net device.
> > > >
> > > > * I also test this solution, it works for both perftest and NoF. *
> > > >
> > > > Another issue is how to get the port state and attributes without
> > > > net device. For this, erdma can get it from HW directly.
> > > >
> > > > So, I think this may be the final solution. (BTW, I have gone over
> > > > the rdma drivers, EFA does in this way, it also has two separated
> > > > devices for net and rdma. It inspired me).
> > >
> > > I'm not sure this works for an iWarp device - various things expect to
> > > know the netdevice to know how to relate IP addresses to the iWarp
> > > stuff - but then I don't really know iWarp.
> >
> > As far as I know, iWarp device only has one GID entry which generated
> > from MAC address.
> >
> > For iWarp, The CM part in core code resolves address, finds
> > route with the help of kernel's net subsystem, and then obtains the
> correct
> > ibdev by GID matching. The GID matching in iWarp is indeed MAC address
> > matching.
> >
> > In another words, for iWarp devices, the core code doesn't handle IP
> > addressing related stuff directly, it is finished by calling net APIs.
> > The netdev set by ib_device_set_netdev does not used in iWarp's CM
> > process.
> >
> > The binded netdev in iWarp devices, mainly have two purposes:
> >   1). generated GID0, using the netdev's mac address.
> >   2). get the port state and attributes.
> >
> > For 1), erdma device binded to net device also by mac address, which can
> > be obtained from our PCIe bar registers.
> > For 2), erdma can also get the information, and may be more accurately.
> > For example, erdma can have different MTU with virtio-net in our cloud.
> >
> > For RoCEv2, I know that it has many GIDs, some of them are generated
> > from IP addresses, and handing IP addressing in core code.
> 
> Bernard, Tom what do you think?
> 
> Jason

I think iWarp (and now RoCEv2 with its UDP dependency) drivers
produce GIDs mostly to satisfy the current RDMA CM infrastructure,
which depends on this type of unique identifier, inherited from IB.
Imo, more natural would be to implement IP based RDMA protocols
connection management by relying on IP addresses.

Sorry for asking again - why erdma does not need to link with netdev?
Can erdma exist without using a netdev?


Thanks,
Bernard.
Tom Talpey May 19, 2022, 6:51 p.m. UTC | #16
On 5/19/2022 12:20 PM, Bernard Metzler wrote:
> 
> 
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, 18 May 2022 18:32
>> To: Cheng Xu <chengyou@linux.alibaba.com>; Bernard Metzler
>> <BMT@zurich.ibm.com>; Tom Talpey <tom@talpey.com>
>> Cc: dledford@redhat.com; leon@kernel.org; linux-rdma@vger.kernel.org;
>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>> Subject: [EXTERNAL] Re: [PATCH for-next v7 10/12] RDMA/erdma: Add the erdma
>> module
>>
>> On Thu, May 19, 2022 at 12:24:22AM +0800, Cheng Xu wrote:
>>>
>>>
>>> On 5/18/22 10:46 PM, Jason Gunthorpe wrote:
>>>> On Wed, May 18, 2022 at 04:30:33PM +0800, Cheng Xu wrote:
>>>>>
>>>>>
>>>>> On 5/10/22 9:17 PM, Jason Gunthorpe wrote:
>>>>>> On Thu, Apr 21, 2022 at 03:17:45PM +0800, Cheng Xu wrote:
>>>>>>
>>>>>>> +static struct rdma_link_ops erdma_link_ops = {
>>>>>>> +	.type = "erdma",
>>>>>>> +	.newlink = erdma_newlink,
>>>>>>> +};
>>>>>>
>>>>>> Why is there still a newlink?
>>>>>>
>>>>>
>>>>> Hello, Jason,
>>>>>
>>>>> About this issue, I have another idea, more simple and reasonable.
>>>>>
>>>>> Maybe erdma driver doesn't need to link to a net device in kernel. In
>>>>> the core code, the ib_device_get_netdev has several use cases:
>>>>>
>>>>>     1). query port info in netlink
>>>>>     2). get eth speed for IB (ib_get_eth_speed)
>>>>>     3). enumerate all RoCE ports (ib_enum_roce_netdev)
>>>>>     4). iw_query_port
>>>>>
>>>>> The cases related to erdma is 4). But we change it in our patch
>> 02/12.
>>>>> So, it seems all right that we do not link erdma to a net device.
>>>>>
>>>>> * I also test this solution, it works for both perftest and NoF. *
>>>>>
>>>>> Another issue is how to get the port state and attributes without
>>>>> net device. For this, erdma can get it from HW directly.
>>>>>
>>>>> So, I think this may be the final solution. (BTW, I have gone over
>>>>> the rdma drivers, EFA does in this way, it also has two separated
>>>>> devices for net and rdma. It inspired me).
>>>>
>>>> I'm not sure this works for an iWarp device - various things expect to
>>>> know the netdevice to know how to relate IP addresses to the iWarp
>>>> stuff - but then I don't really know iWarp.
>>>
>>> As far as I know, iWarp device only has one GID entry which generated
>>> from MAC address.
>>>
>>> For iWarp, The CM part in core code resolves address, finds
>>> route with the help of kernel's net subsystem, and then obtains the
>> correct
>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
>>> matching.
>>>
>>> In another words, for iWarp devices, the core code doesn't handle IP
>>> addressing related stuff directly, it is finished by calling net APIs.
>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>> process.
>>>
>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>    1). generated GID0, using the netdev's mac address.
>>>    2). get the port state and attributes.
>>>
>>> For 1), erdma device binded to net device also by mac address, which can
>>> be obtained from our PCIe bar registers.
>>> For 2), erdma can also get the information, and may be more accurately.
>>> For example, erdma can have different MTU with virtio-net in our cloud.
>>>
>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>> from IP addresses, and handing IP addressing in core code.
>>
>> Bernard, Tom what do you think?
>>
>> Jason
> 
> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
> which depends on this type of unique identifier, inherited from IB.
> Imo, more natural would be to implement IP based RDMA protocols
> connection management by relying on IP addresses.

Agreed. Exposing MAC addresses for this seems so... 20th century.

Tom.

> Sorry for asking again - why erdma does not need to link with netdev?
> Can erdma exist without using a netdev?
> 
> 
> Thanks,
> Bernard
Cheng Xu May 20, 2022, 7:03 a.m. UTC | #17
On 5/20/22 12:20 AM, Bernard Metzler wrote:
> 
> 

<...>

>>> As far as I know, iWarp device only has one GID entry which generated
>>> from MAC address.
>>>
>>> For iWarp, The CM part in core code resolves address, finds
>>> route with the help of kernel's net subsystem, and then obtains the
>> correct
>>> ibdev by GID matching. The GID matching in iWarp is indeed MAC address
>>> matching.
>>>
>>> In another words, for iWarp devices, the core code doesn't handle IP
>>> addressing related stuff directly, it is finished by calling net APIs.
>>> The netdev set by ib_device_set_netdev does not used in iWarp's CM
>>> process.
>>>
>>> The binded netdev in iWarp devices, mainly have two purposes:
>>>    1). generated GID0, using the netdev's mac address.
>>>    2). get the port state and attributes.
>>>
>>> For 1), erdma device binded to net device also by mac address, which can
>>> be obtained from our PCIe bar registers.
>>> For 2), erdma can also get the information, and may be more accurately.
>>> For example, erdma can have different MTU with virtio-net in our cloud.
>>>
>>> For RoCEv2, I know that it has many GIDs, some of them are generated
>>> from IP addresses, and handing IP addressing in core code.
>>
>> Bernard, Tom what do you think?
>>
>> Jason
> 
> I think iWarp (and now RoCEv2 with its UDP dependency) drivers
> produce GIDs mostly to satisfy the current RDMA CM infrastructure,
> which depends on this type of unique identifier, inherited from IB.
> Imo, more natural would be to implement IP based RDMA protocols
> connection management by relying on IP addresses.
> 
> Sorry for asking again - why erdma does not need to link with netdev?
> Can erdma exist without using a netdev?

Actually erdma also need a net device binded to, and so does it.

These days I’m trying to find out acceptable ways to get the reference
of the binded netdev, e,g, the 'struct net_device' pointer. Unlike other
RDMA drivers can get the reference of their binded netdevs' reference 
easily (most RDMA devices are based on the extended aux devices), it is
a little more complex for erdma, because erdma and its binded net device
are two separated PCIe devices.

Then I find that the netdev reference hold in ibdev is rarely used
in core code for iWarp deivces, GID0 is the key attribute (As you and 
Tom mentioned, it appears with the historical need for compatibility,
but I think this is another story).

So, there are two choices for erdma: enum net devices and find the
matched one, or never calling ib_device_set_netdev. The second one has 
less code.

The second way can't work in ROCE. But it works for iWarp (I've tested),
since the netdev reference is rarely used for iWarp in core code, as I
said in last reply.

In short, the question discussed here is that: is it acceptable that
doesn't hold the netdev reference in core code for a iWarp driver
(indeed it has a netdev binded to) ? Or is it necessary that calling
ib_device_set_netdev to set the binded netdev for iWarp driver?

You and Tom both are specialists in iWarp, your opinions are important.

Thanks very much
Cheng Xu


> 
> Thanks,
> Bernard.
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c
new file mode 100644
index 000000000000..86c6f574fc68
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_main.c
@@ -0,0 +1,625 @@ 
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+
+/* Authors: Cheng Xu <chengyou@linux.alibaba.com> */
+/*          Kai Shen <kaishen@linux.alibaba.com> */
+/* Copyright (c) 2020-2022, Alibaba Group. */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <net/addrconf.h>
+#include <rdma/erdma-abi.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+MODULE_AUTHOR("Cheng Xu <chengyou@linux.alibaba.com>");
+MODULE_DESCRIPTION("Alibaba elasticRDMA adapter driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static int erdma_device_register(struct erdma_dev *dev)
+{
+	struct ib_device *ibdev = &dev->ibdev;
+	int ret;
+
+	memset(ibdev->name, 0, IB_DEVICE_NAME_MAX);
+	/*
+	 * In Ali ECS environment, ENI's mac address is unique in VPC.
+	 * So, generating the ibdev's name from mac address of the binded
+	 * netdev.
+	 */
+	ret = snprintf(ibdev->name, IB_DEVICE_NAME_MAX, "%s_%.2x%.2x%.2x",
+		       DRV_MODULE_NAME, dev->attrs.peer_addr[3],
+		       dev->attrs.peer_addr[4], dev->attrs.peer_addr[5]);
+	if (ret < 0)
+		return ret;
+
+	ibdev->phys_port_cnt = 1;
+	ret = ib_device_set_netdev(ibdev, NULL, 1);
+	if (ret)
+		return ret;
+
+	ret = ib_register_device(ibdev, ibdev->name, &dev->pdev->dev);
+	if (ret)
+		dev_err(&dev->pdev->dev,
+			"ib_register_device(%s) failed: ret = %d\n",
+			ibdev->name, ret);
+
+	return ret;
+}
+
+static int erdma_netdev_event(struct notifier_block *nb, unsigned long event,
+			      void *arg)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(arg);
+	struct erdma_dev *dev = container_of(nb, struct erdma_dev, netdev_nb);
+
+	if (dev->netdev == NULL || dev->netdev != netdev)
+		goto done;
+
+	switch (event) {
+	case NETDEV_UP:
+		dev->state = IB_PORT_ACTIVE;
+		erdma_port_event(dev, IB_EVENT_PORT_ACTIVE);
+		break;
+	case NETDEV_DOWN:
+		dev->state = IB_PORT_DOWN;
+		erdma_port_event(dev, IB_EVENT_PORT_ERR);
+		break;
+	case NETDEV_REGISTER:
+	case NETDEV_UNREGISTER:
+	case NETDEV_CHANGEADDR:
+	case NETDEV_CHANGEMTU:
+	case NETDEV_GOING_DOWN:
+	case NETDEV_CHANGE:
+	default:
+		break;
+	}
+
+done:
+	return NOTIFY_OK;
+}
+
+static int erdma_newlink(const char *dev_name, struct net_device *netdev)
+{
+	struct ib_device *ibdev;
+	struct erdma_dev *dev;
+	int ret;
+
+	ibdev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_ERDMA);
+	if (ibdev) {
+		ib_device_put(ibdev);
+		return -EEXIST;
+	}
+
+	ibdev = ib_device_get_by_name(dev_name, RDMA_DRIVER_ERDMA);
+	if (!ibdev)
+		return -ENODEV;
+
+	dev = to_edev(ibdev);
+
+	if (!ether_addr_equal_unaligned(netdev->perm_addr,
+					dev->attrs.peer_addr)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	addrconf_addr_eui48((u8 *)&dev->ibdev.node_guid, netdev->dev_addr);
+	ret = ib_device_set_netdev(&dev->ibdev, netdev, 1);
+	if (ret)
+		goto out;
+
+	if (netif_running(netdev) && netif_carrier_ok(netdev))
+		dev->state = IB_PORT_ACTIVE;
+	else
+		dev->state = IB_PORT_DOWN;
+
+	dev->netdev_nb.notifier_call = erdma_netdev_event;
+	dev->netdev = netdev;
+
+	ret = register_netdevice_notifier(&dev->netdev_nb);
+out:
+	ib_device_put(ibdev);
+	return ret;
+}
+
+static struct rdma_link_ops erdma_link_ops = {
+	.type = "erdma",
+	.newlink = erdma_newlink,
+};
+
+static irqreturn_t erdma_comm_irq_handler(int irq, void *data)
+{
+	struct erdma_dev *dev = data;
+
+	erdma_cmdq_completion_handler(&dev->cmdq);
+	erdma_aeq_event_handler(dev);
+
+	return IRQ_HANDLED;
+}
+
+static void erdma_dwqe_resource_init(struct erdma_dev *dev)
+{
+	int total_pages, type0, type1;
+
+	dev->attrs.grp_num = erdma_reg_read32(dev, ERDMA_REGS_GRP_NUM_REG);
+
+	if (dev->attrs.grp_num < 4)
+		dev->attrs.disable_dwqe = true;
+	else
+		dev->attrs.disable_dwqe = false;
+
+	/* One page contains 4 goups. */
+	total_pages = dev->attrs.grp_num * 4;
+
+	if (dev->attrs.grp_num >= ERDMA_DWQE_MAX_GRP_CNT) {
+		dev->attrs.grp_num = ERDMA_DWQE_MAX_GRP_CNT;
+		type0 = ERDMA_DWQE_TYPE0_CNT;
+		type1 = ERDMA_DWQE_TYPE1_CNT / ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+	} else {
+		type1 = total_pages / 3;
+		type0 = total_pages - type1 - 1;
+	}
+
+	dev->attrs.dwqe_pages = type0;
+	dev->attrs.dwqe_entries = type1 * ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+}
+
+static int erdma_request_vectors(struct erdma_dev *dev)
+{
+	int expect_irq_num = min(num_possible_cpus() + 1, ERDMA_NUM_MSIX_VEC);
+
+	dev->attrs.irq_num = pci_alloc_irq_vectors(dev->pdev, 1, expect_irq_num,
+						   PCI_IRQ_MSIX);
+	if (dev->attrs.irq_num <= 0) {
+		dev_err(&dev->pdev->dev, "request irq vectors failed(%d)\n",
+			dev->attrs.irq_num);
+		return -ENOSPC;
+	}
+
+	return 0;
+}
+
+static int erdma_comm_irq_init(struct erdma_dev *dev)
+{
+	snprintf(dev->comm_irq.name, ERDMA_IRQNAME_SIZE, "erdma-common@pci:%s",
+		 pci_name(dev->pdev));
+	dev->comm_irq.msix_vector =
+		pci_irq_vector(dev->pdev, ERDMA_MSIX_VECTOR_CMDQ);
+
+	cpumask_set_cpu(cpumask_first(cpumask_of_pcibus(dev->pdev->bus)),
+			&dev->comm_irq.affinity_hint_mask);
+	irq_set_affinity_hint(dev->comm_irq.msix_vector,
+			      &dev->comm_irq.affinity_hint_mask);
+
+	return request_irq(dev->comm_irq.msix_vector, erdma_comm_irq_handler, 0,
+			   dev->comm_irq.name, dev);
+}
+
+static void erdma_comm_irq_uninit(struct erdma_dev *dev)
+{
+	irq_set_affinity_hint(dev->comm_irq.msix_vector, NULL);
+	free_irq(dev->comm_irq.msix_vector, dev);
+}
+
+static int erdma_device_init(struct erdma_dev *dev, struct pci_dev *pdev)
+{
+	erdma_dwqe_resource_init(dev);
+
+	return dma_set_mask_and_coherent(&pdev->dev,
+					 DMA_BIT_MASK(ERDMA_PCI_WIDTH));
+}
+
+static void erdma_device_uninit(struct erdma_dev *dev)
+{
+	u32 ctrl = FIELD_PREP(ERDMA_REG_DEV_CTRL_RESET_MASK, 1);
+
+	erdma_reg_write32(dev, ERDMA_REGS_DEV_CTRL_REG, ctrl);
+}
+
+static const struct pci_device_id erdma_pci_tbl[] = {
+	{ PCI_DEVICE(PCI_VENDOR_ID_ALIBABA, 0x107f) },
+	{}
+};
+
+static int erdma_probe_dev(struct pci_dev *pdev)
+{
+	int err;
+	struct erdma_dev *dev;
+	u32 version;
+	int bars;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "pci_enable_device failed(%d)\n", err);
+		return err;
+	}
+
+	pci_set_master(pdev);
+
+	dev = ib_alloc_device(erdma_dev, ibdev);
+	if (!dev) {
+		dev_err(&pdev->dev, "ib_alloc_device failed\n");
+		err = -ENOMEM;
+		goto err_disable_device;
+	}
+
+	pci_set_drvdata(pdev, dev);
+	dev->pdev = pdev;
+	dev->attrs.numa_node = dev_to_node(&pdev->dev);
+
+	bars = pci_select_bars(pdev, IORESOURCE_MEM);
+	err = pci_request_selected_regions(pdev, bars, DRV_MODULE_NAME);
+	if (bars != ERDMA_BAR_MASK || err) {
+		err = err ? err : -EINVAL;
+		goto err_ib_device_release;
+	}
+
+	dev->func_bar_addr = pci_resource_start(pdev, ERDMA_FUNC_BAR);
+	dev->func_bar_len = pci_resource_len(pdev, ERDMA_FUNC_BAR);
+
+	dev->func_bar =
+		devm_ioremap(&pdev->dev, dev->func_bar_addr, dev->func_bar_len);
+	if (!dev->func_bar) {
+		dev_err(&pdev->dev, "devm_ioremap failed.\n");
+		err = -EFAULT;
+		goto err_release_bars;
+	}
+
+	version = erdma_reg_read32(dev, ERDMA_REGS_VERSION_REG);
+	if (version == 0) {
+		/* we knows that it is a non-functional function. */
+		err = -ENODEV;
+		goto err_iounmap_func_bar;
+	}
+
+	err = erdma_device_init(dev, pdev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_request_vectors(dev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_comm_irq_init(dev);
+	if (err)
+		goto err_free_vectors;
+
+	err = erdma_aeq_init(dev);
+	if (err)
+		goto err_uninit_comm_irq;
+
+	err = erdma_cmdq_init(dev);
+	if (err)
+		goto err_uninit_aeq;
+
+	err = erdma_ceqs_init(dev);
+	if (err)
+		goto err_uninit_cmdq;
+
+	erdma_finish_cmdq_init(dev);
+
+	return 0;
+
+err_uninit_cmdq:
+	erdma_device_uninit(dev);
+	erdma_cmdq_destroy(dev);
+
+err_uninit_aeq:
+	erdma_aeq_destroy(dev);
+
+err_uninit_comm_irq:
+	erdma_comm_irq_uninit(dev);
+
+err_free_vectors:
+	pci_free_irq_vectors(dev->pdev);
+
+err_iounmap_func_bar:
+	devm_iounmap(&pdev->dev, dev->func_bar);
+
+err_release_bars:
+	pci_release_selected_regions(pdev, bars);
+
+err_ib_device_release:
+	ib_dealloc_device(&dev->ibdev);
+
+err_disable_device:
+	pci_disable_device(pdev);
+
+	return err;
+}
+
+static void erdma_remove_dev(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	erdma_ceqs_uninit(dev);
+
+	erdma_device_uninit(dev);
+
+	erdma_cmdq_destroy(dev);
+	erdma_aeq_destroy(dev);
+	erdma_comm_irq_uninit(dev);
+	pci_free_irq_vectors(dev->pdev);
+
+	devm_iounmap(&pdev->dev, dev->func_bar);
+	pci_release_selected_regions(pdev, ERDMA_BAR_MASK);
+
+	ib_dealloc_device(&dev->ibdev);
+
+	pci_disable_device(pdev);
+}
+
+#define ERDMA_GET_CAP(name, cap) FIELD_GET(ERDMA_CMD_DEV_CAP_##name##_MASK, cap)
+
+static int erdma_dev_attrs_init(struct erdma_dev *dev)
+{
+	int err;
+	u64 req_hdr, cap0, cap1;
+
+	erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_RDMA,
+				CMDQ_OPCODE_QUERY_DEVICE);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0,
+				  &cap1);
+	if (err)
+		return err;
+
+	dev->attrs.max_cqe = 1 << ERDMA_GET_CAP(MAX_CQE, cap0);
+	dev->attrs.max_mr_size = 1ULL << ERDMA_GET_CAP(MAX_MR_SIZE, cap0);
+	dev->attrs.max_mw = 1 << ERDMA_GET_CAP(MAX_MW, cap1);
+	dev->attrs.max_recv_wr = 1 << ERDMA_GET_CAP(MAX_RECV_WR, cap0);
+	dev->attrs.local_dma_key = ERDMA_GET_CAP(DMA_LOCAL_KEY, cap1);
+	dev->attrs.cc = ERDMA_GET_CAP(DEFAULT_CC, cap1);
+	dev->attrs.max_qp = ERDMA_NQP_PER_QBLOCK * ERDMA_GET_CAP(QBLOCK, cap1);
+	dev->attrs.max_mr = dev->attrs.max_qp << 1;
+	dev->attrs.max_cq = dev->attrs.max_qp << 1;
+
+	dev->attrs.max_send_wr = ERDMA_MAX_SEND_WR;
+	dev->attrs.max_ord = ERDMA_MAX_ORD;
+	dev->attrs.max_ird = ERDMA_MAX_IRD;
+	dev->attrs.max_send_sge = ERDMA_MAX_SEND_SGE;
+	dev->attrs.max_recv_sge = ERDMA_MAX_RECV_SGE;
+	dev->attrs.max_sge_rd = ERDMA_MAX_SGE_RD;
+	dev->attrs.max_pd = ERDMA_MAX_PD;
+
+	dev->res_cb[ERDMA_RES_TYPE_PD].max_cap = ERDMA_MAX_PD;
+	dev->res_cb[ERDMA_RES_TYPE_STAG_IDX].max_cap = dev->attrs.max_mr;
+
+	erdma_cmdq_build_reqhdr(&req_hdr, CMDQ_SUBMOD_COMMON,
+				CMDQ_OPCODE_QUERY_FW_INFO);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0,
+				  &cap1);
+
+	dev->attrs.fw_version = FIELD_GET(ERDMA_CMD_INFO0_FW_VER_MASK, cap0);
+
+	return err;
+}
+
+static int erdma_res_cb_init(struct erdma_dev *dev)
+{
+	int i, j;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++) {
+		dev->res_cb[i].next_alloc_idx = 1;
+		spin_lock_init(&dev->res_cb[i].lock);
+		dev->res_cb[i].bitmap =
+			kcalloc(BITS_TO_LONGS(dev->res_cb[i].max_cap),
+				sizeof(unsigned long), GFP_KERNEL);
+		/* We will free the memory in erdma_res_cb_free */
+		if (!dev->res_cb[i].bitmap)
+			goto err;
+	}
+
+	return 0;
+
+err:
+	for (j = 0; j < i; j++)
+		kfree(dev->res_cb[j].bitmap);
+
+	return -ENOMEM;
+}
+
+static void erdma_res_cb_free(struct erdma_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++)
+		kfree(dev->res_cb[i].bitmap);
+}
+
+static const struct ib_device_ops erdma_device_ops = {
+	.owner = THIS_MODULE,
+	.driver_id = RDMA_DRIVER_ERDMA,
+	.uverbs_abi_ver = ERDMA_ABI_VERSION,
+
+	.alloc_mr = erdma_ib_alloc_mr,
+	.alloc_pd = erdma_alloc_pd,
+	.alloc_ucontext = erdma_alloc_ucontext,
+	.create_cq = erdma_create_cq,
+	.create_qp = erdma_create_qp,
+	.dealloc_pd = erdma_dealloc_pd,
+	.dealloc_ucontext = erdma_dealloc_ucontext,
+	.dereg_mr = erdma_dereg_mr,
+	.destroy_cq = erdma_destroy_cq,
+	.destroy_qp = erdma_destroy_qp,
+	.get_dma_mr = erdma_get_dma_mr,
+	.get_port_immutable = erdma_get_port_immutable,
+	.iw_accept = erdma_accept,
+	.iw_add_ref = erdma_qp_get_ref,
+	.iw_connect = erdma_connect,
+	.iw_create_listen = erdma_create_listen,
+	.iw_destroy_listen = erdma_destroy_listen,
+	.iw_get_qp = erdma_get_ibqp,
+	.iw_reject = erdma_reject,
+	.iw_rem_ref = erdma_qp_put_ref,
+	.map_mr_sg = erdma_map_mr_sg,
+	.mmap = erdma_mmap,
+	.mmap_free = erdma_mmap_free,
+	.modify_qp = erdma_modify_qp,
+	.post_recv = erdma_post_recv,
+	.post_send = erdma_post_send,
+	.poll_cq = erdma_poll_cq,
+	.query_device = erdma_query_device,
+	.query_gid = erdma_query_gid,
+	.query_port = erdma_query_port,
+	.query_qp = erdma_query_qp,
+	.req_notify_cq = erdma_req_notify_cq,
+	.reg_user_mr = erdma_reg_user_mr,
+
+	INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq),
+	INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd),
+	INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext),
+	INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp),
+};
+
+static int erdma_ib_device_add(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+	struct ib_device *ibdev = &dev->ibdev;
+	u64 mac;
+	int ret = 0;
+
+	ret = erdma_dev_attrs_init(dev);
+	if (ret)
+		return ret;
+
+	ibdev->node_type = RDMA_NODE_RNIC;
+	memcpy(ibdev->node_desc, ERDMA_NODE_DESC, sizeof(ERDMA_NODE_DESC));
+
+	/*
+	 * Current model (one-to-one device association):
+	 * One ERDMA device per net_device or, equivalently,
+	 * per physical port.
+	 */
+	ibdev->phys_port_cnt = 1;
+	ibdev->num_comp_vectors = dev->attrs.irq_num - 1;
+
+	ib_set_device_ops(ibdev, &erdma_device_ops);
+
+	INIT_LIST_HEAD(&dev->cep_list);
+
+	spin_lock_init(&dev->lock);
+	xa_init_flags(&dev->qp_xa, XA_FLAGS_ALLOC1);
+	xa_init_flags(&dev->cq_xa, XA_FLAGS_ALLOC1);
+	dev->next_alloc_cqn = 1;
+	dev->next_alloc_qpn = 1;
+
+	ret = erdma_res_cb_init(dev);
+	if (ret)
+		return ret;
+
+	spin_lock_init(&dev->db_bitmap_lock);
+	bitmap_zero(dev->sdb_page, ERDMA_DWQE_TYPE0_CNT);
+	bitmap_zero(dev->sdb_entry, ERDMA_DWQE_TYPE1_CNT);
+
+	atomic_set(&dev->num_ctx, 0);
+
+	mac = erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_L_REG);
+	mac |= (u64)erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_H_REG) << 32;
+
+	u64_to_ether_addr(mac, dev->attrs.peer_addr);
+
+	ret = erdma_device_register(dev);
+	if (ret)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+
+	erdma_res_cb_free(dev);
+
+	return ret;
+}
+
+static void erdma_ib_device_remove(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	if (dev->netdev) {
+		ib_device_set_netdev(&dev->ibdev, NULL, 1);
+		dev->netdev = NULL;
+		unregister_netdevice_notifier(&dev->netdev_nb);
+	}
+
+	ib_unregister_device(&dev->ibdev);
+
+	erdma_res_cb_free(dev);
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+}
+
+static int erdma_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	int ret;
+
+	ret = erdma_probe_dev(pdev);
+	if (ret)
+		return ret;
+
+	ret = erdma_ib_device_add(pdev);
+	if (ret) {
+		erdma_remove_dev(pdev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void erdma_remove(struct pci_dev *pdev)
+{
+	erdma_ib_device_remove(pdev);
+	erdma_remove_dev(pdev);
+}
+
+static struct pci_driver erdma_pci_driver = {
+	.name = DRV_MODULE_NAME,
+	.id_table = erdma_pci_tbl,
+	.probe = erdma_probe,
+	.remove = erdma_remove
+};
+
+MODULE_DEVICE_TABLE(pci, erdma_pci_tbl);
+
+static __init int erdma_init_module(void)
+{
+	int ret;
+
+	ret = erdma_cm_init();
+	if (ret)
+		return ret;
+
+	ret = pci_register_driver(&erdma_pci_driver);
+	if (ret) {
+		erdma_cm_exit();
+		return ret;
+	}
+
+	rdma_link_register(&erdma_link_ops);
+
+	return 0;
+}
+
+static void __exit erdma_exit_module(void)
+{
+	rdma_link_unregister(&erdma_link_ops);
+
+	pci_unregister_driver(&erdma_pci_driver);
+
+	erdma_cm_exit();
+}
+
+module_init(erdma_init_module);
+module_exit(erdma_exit_module);