diff mbox series

[V3,net-next] net: mana: Add page pool for RX buffers

Message ID 1689966321-17337-1-git-send-email-haiyangz@microsoft.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [V3,net-next] net: mana: Add page pool for RX buffers | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1342 this patch: 1342
netdev/cc_maintainers success CCed 19 of 19 maintainers
netdev/build_clang success Errors and warnings before: 1365 this patch: 1365
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1365 this patch: 1365
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 240 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Haiyang Zhang July 21, 2023, 7:05 p.m. UTC
Add page pool for RX buffers for faster buffer cycle and reduce CPU
usage.

The standard page pool API is used.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
---
V3:
Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
V2:
Use the standard page pool API as suggested by Jesper Dangaard Brouer

---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++----
 include/net/mana/mana.h                       |  3 +
 2 files changed, 78 insertions(+), 16 deletions(-)

Comments

Jesper Dangaard Brouer July 24, 2023, 11:29 a.m. UTC | #1
On 21/07/2023 21.05, Haiyang Zhang wrote:
> Add page pool for RX buffers for faster buffer cycle and reduce CPU
> usage.
> 
> The standard page pool API is used.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
> V3:
> Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
> V2:
> Use the standard page pool API as suggested by Jesper Dangaard Brouer
> 
> ---
>   drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++----
>   include/net/mana/mana.h                       |  3 +
>   2 files changed, 78 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index a499e460594b..4307f25f8c7a 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
[...]
> @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
>   
>   	if (rxq->xdp_flush)
>   		xdp_do_flush();
> +
> +	page_pool_nid_changed(rxq->page_pool, numa_mem_id());

I don't think this page_pool_nid_changed() called is needed, if you do
as I suggest below (nid = NUMA_NO_NODE).


>   }
>   
>   static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
[...]

> @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq)
>   	return 0;
>   }
>   
> +static int mana_create_page_pool(struct mana_rxq *rxq)
> +{
> +	struct page_pool_params pprm = {};

You are implicitly assigning NUMA node id zero.

> +	int ret;
> +
> +	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
> +	pprm.napi = &rxq->rx_cq.napi;

You likely want to assign pprm.nid to NUMA_NO_NODE

  pprm.nid = NUMA_NO_NODE;

For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1)
as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled
this setting will automatically select the (preferred) NUMA node (via
``numa_mem_id()``) based on where NAPI RX-processing is currently
running. The effect is that page_pool will only use recycled memory when
NUMA node match running CPU. This assumes CPU refilling driver RX-ring
will also run RX-NAPI.

If a driver want more control over the NUMA node memory selection,
drivers can assign (``pp_params.nid``) something else than
`NUMA_NO_NODE`` and runtime adjust via function ``page_pool_nid_changed()``.

I will update [1] with this info.
  - Docs [1] 
https://kernel.org/doc/html/latest/networking/page_pool.html#registration


> +
> +	rxq->page_pool = page_pool_create(&pprm);
> +
> +	if (IS_ERR(rxq->page_pool)) {
> +		ret = PTR_ERR(rxq->page_pool);
> +		rxq->page_pool = NULL;
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
Haiyang Zhang July 24, 2023, 3:46 p.m. UTC | #2
> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Sent: Monday, July 24, 2023 7:29 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org
> Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>;
> olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net;
> wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>;
> ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org;
> daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org;
> ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; hawk@kernel.org;
> tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux-
> kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>; Jesper
> Dangaard Brouer <hawk@kernel.org>
> Subject: Re: [PATCH V3,net-next] net: mana: Add page pool for RX buffers
> 
> 
> 
> On 21/07/2023 21.05, Haiyang Zhang wrote:
> > Add page pool for RX buffers for faster buffer cycle and reduce CPU
> > usage.
> >
> > The standard page pool API is used.
> >
> > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > ---
> > V3:
> > Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
> > V2:
> > Use the standard page pool API as suggested by Jesper Dangaard Brouer
> >
> > ---
> >   drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++----
> >   include/net/mana/mana.h                       |  3 +
> >   2 files changed, 78 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index a499e460594b..4307f25f8c7a 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> [...]
> > @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
> >
> >   	if (rxq->xdp_flush)
> >   		xdp_do_flush();
> > +
> > +	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
> 
> I don't think this page_pool_nid_changed() called is needed, if you do
> as I suggest below (nid = NUMA_NO_NODE).
> 
> 
> >   }
> >
> >   static int mana_cq_handler(void *context, struct gdma_queue
> *gdma_queue)
> [...]
> 
> > @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq *rxq)
> >   	return 0;
> >   }
> >
> > +static int mana_create_page_pool(struct mana_rxq *rxq)
> > +{
> > +	struct page_pool_params pprm = {};
> 
> You are implicitly assigning NUMA node id zero.
> 
> > +	int ret;
> > +
> > +	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
> > +	pprm.napi = &rxq->rx_cq.napi;
> 
> You likely want to assign pprm.nid to NUMA_NO_NODE
> 
>   pprm.nid = NUMA_NO_NODE;
> 
> For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1)
> as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled
> this setting will automatically select the (preferred) NUMA node (via
> ``numa_mem_id()``) based on where NAPI RX-processing is currently
> running. The effect is that page_pool will only use recycled memory when
> NUMA node match running CPU. This assumes CPU refilling driver RX-ring
> will also run RX-NAPI.
> 
> If a driver want more control over the NUMA node memory selection,
> drivers can assign (``pp_params.nid``) something else than
> `NUMA_NO_NODE`` and runtime adjust via function
> ``page_pool_nid_changed()``.

Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id 
to zero during pool init. 

And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed()
will update the nid for the pool. Does this sound good?

Thanks,
-Haiyang
Haiyang Zhang July 24, 2023, 6:35 p.m. UTC | #3
> -----Original Message-----
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Sent: Monday, July 24, 2023 11:46 AM
> To: Jesper Dangaard Brouer <jbrouer@redhat.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org
> Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>;
> olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net;
> wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org;
> pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>;
> ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org;
> daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org;
> ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>; hawk@kernel.org;
> tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux-
> kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>; Jesper
> Dangaard Brouer <hawk@kernel.org>
> Subject: RE: [PATCH V3,net-next] net: mana: Add page pool for RX buffers
> 
> 
> 
> > -----Original Message-----
> > From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> > Sent: Monday, July 24, 2023 7:29 AM
> > To: Haiyang Zhang <haiyangz@microsoft.com>; linux-hyperv@vger.kernel.org;
> > netdev@vger.kernel.org
> > Cc: brouer@redhat.com; Dexuan Cui <decui@microsoft.com>; KY Srinivasan
> > <kys@microsoft.com>; Paul Rosswurm <paulros@microsoft.com>;
> > olaf@aepfle.de; vkuznets@redhat.com; davem@davemloft.net;
> > wei.liu@kernel.org; edumazet@google.com; kuba@kernel.org;
> > pabeni@redhat.com; leon@kernel.org; Long Li <longli@microsoft.com>;
> > ssengar@linux.microsoft.com; linux-rdma@vger.kernel.org;
> > daniel@iogearbox.net; john.fastabend@gmail.com; bpf@vger.kernel.org;
> > ast@kernel.org; Ajay Sharma <sharmaajay@microsoft.com>;
> hawk@kernel.org;
> > tglx@linutronix.de; shradhagupta@linux.microsoft.com; linux-
> > kernel@vger.kernel.org; Ilias Apalodimas <ilias.apalodimas@linaro.org>;
> Jesper
> > Dangaard Brouer <hawk@kernel.org>
> > Subject: Re: [PATCH V3,net-next] net: mana: Add page pool for RX buffers
> >
> >
> >
> > On 21/07/2023 21.05, Haiyang Zhang wrote:
> > > Add page pool for RX buffers for faster buffer cycle and reduce CPU
> > > usage.
> > >
> > > The standard page pool API is used.
> > >
> > > Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > ---
> > > V3:
> > > Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
> > > V2:
> > > Use the standard page pool API as suggested by Jesper Dangaard Brouer
> > >
> > > ---
> > >   drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++--
> --
> > >   include/net/mana/mana.h                       |  3 +
> > >   2 files changed, 78 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > index a499e460594b..4307f25f8c7a 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > [...]
> > > @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
> > >
> > >   	if (rxq->xdp_flush)
> > >   		xdp_do_flush();
> > > +
> > > +	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
> >
> > I don't think this page_pool_nid_changed() called is needed, if you do
> > as I suggest below (nid = NUMA_NO_NODE).
> >
> >
> > >   }
> > >
> > >   static int mana_cq_handler(void *context, struct gdma_queue
> > *gdma_queue)
> > [...]
> >
> > > @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq
> *rxq)
> > >   	return 0;
> > >   }
> > >
> > > +static int mana_create_page_pool(struct mana_rxq *rxq)
> > > +{
> > > +	struct page_pool_params pprm = {};
> >
> > You are implicitly assigning NUMA node id zero.
> >
> > > +	int ret;
> > > +
> > > +	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
> > > +	pprm.napi = &rxq->rx_cq.napi;
> >
> > You likely want to assign pprm.nid to NUMA_NO_NODE
> >
> >   pprm.nid = NUMA_NO_NODE;
> >
> > For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1)
> > as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled
> > this setting will automatically select the (preferred) NUMA node (via
> > ``numa_mem_id()``) based on where NAPI RX-processing is currently
> > running. The effect is that page_pool will only use recycled memory when
> > NUMA node match running CPU. This assumes CPU refilling driver RX-ring
> > will also run RX-NAPI.
> >
> > If a driver want more control over the NUMA node memory selection,
> > drivers can assign (``pp_params.nid``) something else than
> > `NUMA_NO_NODE`` and runtime adjust via function
> > ``page_pool_nid_changed()``.
> 
> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id
> to zero during pool init.
> 
> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed()
> will update the nid for the pool. Does this sound good?
> 

Also, since our driver is getting the default node from here:
	gc->numa_node = dev_to_node(&pdev->dev);
I will update this patch to set the default node as above, instead of implicitly
assigning it to 0.

Thanks,
- Haiyang
Jesper Dangaard Brouer July 25, 2023, 6:01 p.m. UTC | #4
On 24/07/2023 20.35, Haiyang Zhang wrote:
> 
[...]
>>> On 21/07/2023 21.05, Haiyang Zhang wrote:
>>>> Add page pool for RX buffers for faster buffer cycle and reduce CPU
>>>> usage.
>>>>
>>>> The standard page pool API is used.
>>>>
>>>> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
>>>> ---
>>>> V3:
>>>> Update xdp mem model, pool param, alloc as suggested by Jakub Kicinski
>>>> V2:
>>>> Use the standard page pool API as suggested by Jesper Dangaard Brouer
>>>>
>>>> ---
>>>>    drivers/net/ethernet/microsoft/mana/mana_en.c | 91 +++++++++++++++--
>> --
>>>>    include/net/mana/mana.h                       |  3 +
>>>>    2 files changed, 78 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
>>> b/drivers/net/ethernet/microsoft/mana/mana_en.c
>>>> index a499e460594b..4307f25f8c7a 100644
>>>> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
>>>> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>>> [...]
>>>> @@ -1659,6 +1679,8 @@ static void mana_poll_rx_cq(struct mana_cq *cq)
>>>>
>>>>    	if (rxq->xdp_flush)
>>>>    		xdp_do_flush();
>>>> +
>>>> +	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
>>>
>>> I don't think this page_pool_nid_changed() called is needed, if you do
>>> as I suggest below (nid = NUMA_NO_NODE).
>>>
>>>
>>>>    }
>>>>
>>>>    static int mana_cq_handler(void *context, struct gdma_queue
>>> *gdma_queue)
>>> [...]
>>>
>>>> @@ -2008,6 +2041,25 @@ static int mana_push_wqe(struct mana_rxq
>> *rxq)
>>>>    	return 0;
>>>>    }
>>>>
>>>> +static int mana_create_page_pool(struct mana_rxq *rxq)
>>>> +{
>>>> +	struct page_pool_params pprm = {};
>>>
>>> You are implicitly assigning NUMA node id zero.
>>>
>>>> +	int ret;
>>>> +
>>>> +	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
>>>> +	pprm.napi = &rxq->rx_cq.napi;
>>>
>>> You likely want to assign pprm.nid to NUMA_NO_NODE
>>>
>>>    pprm.nid = NUMA_NO_NODE;
>>>
>>> For most drivers it is recommended to assign ``NUMA_NO_NODE`` (value -1)
>>> as the NUMA ID to ``pp_params.nid``. When ``CONFIG_NUMA`` is enabled
>>> this setting will automatically select the (preferred) NUMA node (via
>>> ``numa_mem_id()``) based on where NAPI RX-processing is currently
>>> running. The effect is that page_pool will only use recycled memory when
>>> NUMA node match running CPU. This assumes CPU refilling driver RX-ring
>>> will also run RX-NAPI.
>>>
>>> If a driver want more control over the NUMA node memory selection,
>>> drivers can assign (``pp_params.nid``) something else than
>>> `NUMA_NO_NODE`` and runtime adjust via function
>>> ``page_pool_nid_changed()``.
>>
>> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id
>> to zero during pool init.
>>
>> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed()
>> will update the nid for the pool. Does this sound good?
>>
> 
> Also, since our driver is getting the default node from here:
> 	gc->numa_node = dev_to_node(&pdev->dev);
> I will update this patch to set the default node as above, instead of implicitly
> assigning it to 0.
> 

In that case, I agree that it make sense to use dev_to_node(&pdev->dev), 
like:
	pprm.nid = dev_to_node(&pdev->dev);

Driver must have a reason for assigning gc->numa_node for this hardware,
which is okay. That is why page_pool API allows driver to control this.

But then I don't think you should call page_pool_nid_changed() like

	page_pool_nid_changed(rxq->page_pool, numa_mem_id());

Because then you will (at first packet processing event) revert the
dev_to_node() setting to use numa_mem_id() of processing/running CPU.
(In effect this will be the same as setting NUMA_NO_NODE).

I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark
numbers that this was preferred action, even-when sysadm had
"misconfigured" the default smp_affinity RX-processing to happen on a
remote NUMA node.  AFAIK mlx5 keeps the descriptor rings on the
originally configured NUMA node that corresponds to the NIC PCIe slot.

--Jesper
Haiyang Zhang July 25, 2023, 7:02 p.m. UTC | #5
> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Sent: Tuesday, July 25, 2023 2:01 PM
> >>
> >> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id
> >> to zero during pool init.
> >>
> >> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed()
> >> will update the nid for the pool. Does this sound good?
> >>
> >
> > Also, since our driver is getting the default node from here:
> > 	gc->numa_node = dev_to_node(&pdev->dev);
> > I will update this patch to set the default node as above, instead of implicitly
> > assigning it to 0.
> >
> 
> In that case, I agree that it make sense to use dev_to_node(&pdev->dev),
> like:
> 	pprm.nid = dev_to_node(&pdev->dev);
> 
> Driver must have a reason for assigning gc->numa_node for this hardware,
> which is okay. That is why page_pool API allows driver to control this.
> 
> But then I don't think you should call page_pool_nid_changed() like
> 
> 	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
> 
> Because then you will (at first packet processing event) revert the
> dev_to_node() setting to use numa_mem_id() of processing/running CPU.
> (In effect this will be the same as setting NUMA_NO_NODE).
> 
> I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark
> numbers that this was preferred action, even-when sysadm had
> "misconfigured" the default smp_affinity RX-processing to happen on a
> remote NUMA node.  AFAIK mlx5 keeps the descriptor rings on the
> originally configured NUMA node that corresponds to the NIC PCIe slot.

In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc->numa_node 
too, so it won't revert the nid initial setting. 

Currently, the Azure hypervisor always indicates numa 0 as default. (In 
the future, it will start to provide the accurate default dev node.) When a 
user manually changes the IRQ/CPU affinity for perf tuning, we want to 
allow page_pool_nid_changed() to update the pool. Is this OK?

Thanks,
- Haiyang
Jesper Dangaard Brouer July 26, 2023, 9:22 a.m. UTC | #6
On 25/07/2023 21.02, Haiyang Zhang wrote:
> 
>> -----Original Message-----
>> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
>> Sent: Tuesday, July 25, 2023 2:01 PM
>>>>
>>>> Our driver is using NUMA 0 by default, so I implicitly assign NUMA node id
>>>> to zero during pool init.
>>>>
>>>> And, if the IRQ/CPU affinity is changed, the page_pool_nid_changed()
>>>> will update the nid for the pool. Does this sound good?
>>>>
>>>
>>> Also, since our driver is getting the default node from here:
>>> 	gc->numa_node = dev_to_node(&pdev->dev);
>>> I will update this patch to set the default node as above, instead of implicitly
>>> assigning it to 0.
>>>
>>
>> In that case, I agree that it make sense to use dev_to_node(&pdev->dev),
>> like:
>> 	pprm.nid = dev_to_node(&pdev->dev);
>>
>> Driver must have a reason for assigning gc->numa_node for this hardware,
>> which is okay. That is why page_pool API allows driver to control this.
>>
>> But then I don't think you should call page_pool_nid_changed() like
>>
>> 	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
>>
>> Because then you will (at first packet processing event) revert the
>> dev_to_node() setting to use numa_mem_id() of processing/running CPU.
>> (In effect this will be the same as setting NUMA_NO_NODE).
>>
>> I know, mlx5 do call page_pool_nid_changed(), but they showed benchmark
>> numbers that this was preferred action, even-when sysadm had
>> "misconfigured" the default smp_affinity RX-processing to happen on a
>> remote NUMA node.  AFAIK mlx5 keeps the descriptor rings on the
>> originally configured NUMA node that corresponds to the NIC PCIe slot.
> 
> In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc->numa_node
> too, so it won't revert the nid initial setting.
> 
> Currently, the Azure hypervisor always indicates numa 0 as default. (In
> the future, it will start to provide the accurate default dev node.) When a
> user manually changes the IRQ/CPU affinity for perf tuning, we want to
> allow page_pool_nid_changed() to update the pool. Is this OK?
> 

If I were you, I would wait with the page_pool_nid_changed()
"optimization" and do a benchmark mark to see if this actually have a
benefit.  (You can do this in another patch).  (In a Azure hypervisor
environment is might not be the right choice).

This reminds me, do you have any benchmark data on the improvement this
patch (using page_pool) gave?

--Jesper
Haiyang Zhang July 26, 2023, 3:51 p.m. UTC | #7
> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Sent: Wednesday, July 26, 2023 5:23 AM
> >
> > In mana_gd_setup_irqs(), we set the default IRQ/CPU affinity to gc-
> >numa_node
> > too, so it won't revert the nid initial setting.
> >
> > Currently, the Azure hypervisor always indicates numa 0 as default. (In
> > the future, it will start to provide the accurate default dev node.) When a
> > user manually changes the IRQ/CPU affinity for perf tuning, we want to
> > allow page_pool_nid_changed() to update the pool. Is this OK?
> >
> 
> If I were you, I would wait with the page_pool_nid_changed()
> "optimization" and do a benchmark mark to see if this actually have a
> benefit.  (You can do this in another patch).  (In a Azure hypervisor
> environment is might not be the right choice).
Ok, I will submit a patch without the page_pool_nid_changed() optimization 
for now, and will do more testing on this.

> This reminds me, do you have any benchmark data on the improvement this
> patch (using page_pool) gave?
With iperf and 128 threads test, this patch improved the throughput by 12-15%, 
and decreased the IRQ associated CPU's usage from 99-100% to 10-50%.

Thanks,
- Haiyang
diff mbox series

Patch

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index a499e460594b..4307f25f8c7a 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1414,8 +1414,8 @@  static struct sk_buff *mana_build_skb(struct mana_rxq *rxq, void *buf_va,
 	return skb;
 }
 
-static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
-			struct mana_rxq *rxq)
+static void mana_rx_skb(void *buf_va, bool from_pool,
+			struct mana_rxcomp_oob *cqe, struct mana_rxq *rxq)
 {
 	struct mana_stats_rx *rx_stats = &rxq->stats;
 	struct net_device *ndev = rxq->ndev;
@@ -1448,6 +1448,9 @@  static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	if (!skb)
 		goto drop;
 
+	if (from_pool)
+		skb_mark_for_recycle(skb);
+
 	skb->dev = napi->dev;
 
 	skb->protocol = eth_type_trans(skb, ndev);
@@ -1498,9 +1501,14 @@  static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 	u64_stats_update_end(&rx_stats->syncp);
 
 drop:
-	WARN_ON_ONCE(rxq->xdp_save_va);
-	/* Save for reuse */
-	rxq->xdp_save_va = buf_va;
+	if (from_pool) {
+		page_pool_recycle_direct(rxq->page_pool,
+					 virt_to_head_page(buf_va));
+	} else {
+		WARN_ON_ONCE(rxq->xdp_save_va);
+		/* Save for reuse */
+		rxq->xdp_save_va = buf_va;
+	}
 
 	++ndev->stats.rx_dropped;
 
@@ -1508,11 +1516,13 @@  static void mana_rx_skb(void *buf_va, struct mana_rxcomp_oob *cqe,
 }
 
 static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
-			     dma_addr_t *da, bool is_napi)
+			     dma_addr_t *da, bool *from_pool, bool is_napi)
 {
 	struct page *page;
 	void *va;
 
+	*from_pool = false;
+
 	/* Reuse XDP dropped page if available */
 	if (rxq->xdp_save_va) {
 		va = rxq->xdp_save_va;
@@ -1533,17 +1543,22 @@  static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 			return NULL;
 		}
 	} else {
-		page = dev_alloc_page();
+		page = page_pool_dev_alloc_pages(rxq->page_pool);
 		if (!page)
 			return NULL;
 
+		*from_pool = true;
 		va = page_to_virt(page);
 	}
 
 	*da = dma_map_single(dev, va + rxq->headroom, rxq->datasize,
 			     DMA_FROM_DEVICE);
 	if (dma_mapping_error(dev, *da)) {
-		put_page(virt_to_head_page(va));
+		if (*from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, is_napi);
+		else
+			put_page(virt_to_head_page(va));
+
 		return NULL;
 	}
 
@@ -1552,21 +1567,25 @@  static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
 
 /* Allocate frag for rx buffer, and save the old buf */
 static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
-			       struct mana_recv_buf_oob *rxoob, void **old_buf)
+			       struct mana_recv_buf_oob *rxoob, void **old_buf,
+			       bool *old_fp)
 {
+	bool from_pool;
 	dma_addr_t da;
 	void *va;
 
-	va = mana_get_rxfrag(rxq, dev, &da, true);
+	va = mana_get_rxfrag(rxq, dev, &da, &from_pool, true);
 	if (!va)
 		return;
 
 	dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
 			 DMA_FROM_DEVICE);
 	*old_buf = rxoob->buf_va;
+	*old_fp = rxoob->from_pool;
 
 	rxoob->buf_va = va;
 	rxoob->sgl[0].address = da;
+	rxoob->from_pool = from_pool;
 }
 
 static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -1580,6 +1599,7 @@  static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	struct device *dev = gc->dev;
 	void *old_buf = NULL;
 	u32 curr, pktlen;
+	bool old_fp;
 
 	apc = netdev_priv(ndev);
 
@@ -1622,12 +1642,12 @@  static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
 	rxbuf_oob = &rxq->rx_oobs[curr];
 	WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
 
-	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf);
+	mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
 
 	/* Unsuccessful refill will have old_buf == NULL.
 	 * In this case, mana_rx_skb() will drop the packet.
 	 */
-	mana_rx_skb(old_buf, oob, rxq);
+	mana_rx_skb(old_buf, old_fp, oob, rxq);
 
 drop:
 	mana_move_wq_tail(rxq->gdma_rq, rxbuf_oob->wqe_inf.wqe_size_in_bu);
@@ -1659,6 +1679,8 @@  static void mana_poll_rx_cq(struct mana_cq *cq)
 
 	if (rxq->xdp_flush)
 		xdp_do_flush();
+
+	page_pool_nid_changed(rxq->page_pool, numa_mem_id());
 }
 
 static int mana_cq_handler(void *context, struct gdma_queue *gdma_queue)
@@ -1881,6 +1903,7 @@  static void mana_destroy_rxq(struct mana_port_context *apc,
 	struct mana_recv_buf_oob *rx_oob;
 	struct device *dev = gc->dev;
 	struct napi_struct *napi;
+	struct page *page;
 	int i;
 
 	if (!rxq)
@@ -1913,10 +1936,18 @@  static void mana_destroy_rxq(struct mana_port_context *apc,
 		dma_unmap_single(dev, rx_oob->sgl[0].address,
 				 rx_oob->sgl[0].size, DMA_FROM_DEVICE);
 
-		put_page(virt_to_head_page(rx_oob->buf_va));
+		page = virt_to_head_page(rx_oob->buf_va);
+
+		if (rx_oob->from_pool)
+			page_pool_put_full_page(rxq->page_pool, page, false);
+		else
+			put_page(page);
+
 		rx_oob->buf_va = NULL;
 	}
 
+	page_pool_destroy(rxq->page_pool);
+
 	if (rxq->gdma_rq)
 		mana_gd_destroy_queue(gc, rxq->gdma_rq);
 
@@ -1927,18 +1958,20 @@  static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
 			    struct mana_rxq *rxq, struct device *dev)
 {
 	struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+	bool from_pool = false;
 	dma_addr_t da;
 	void *va;
 
 	if (mpc->rxbufs_pre)
 		va = mana_get_rxbuf_pre(rxq, &da);
 	else
-		va = mana_get_rxfrag(rxq, dev, &da, false);
+		va = mana_get_rxfrag(rxq, dev, &da, &from_pool, false);
 
 	if (!va)
 		return -ENOMEM;
 
 	rx_oob->buf_va = va;
+	rx_oob->from_pool = from_pool;
 
 	rx_oob->sgl[0].address = da;
 	rx_oob->sgl[0].size = rxq->datasize;
@@ -2008,6 +2041,25 @@  static int mana_push_wqe(struct mana_rxq *rxq)
 	return 0;
 }
 
+static int mana_create_page_pool(struct mana_rxq *rxq)
+{
+	struct page_pool_params pprm = {};
+	int ret;
+
+	pprm.pool_size = RX_BUFFERS_PER_QUEUE;
+	pprm.napi = &rxq->rx_cq.napi;
+
+	rxq->page_pool = page_pool_create(&pprm);
+
+	if (IS_ERR(rxq->page_pool)) {
+		ret = PTR_ERR(rxq->page_pool);
+		rxq->page_pool = NULL;
+		return ret;
+	}
+
+	return 0;
+}
+
 static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 					u32 rxq_idx, struct mana_eq *eq,
 					struct net_device *ndev)
@@ -2037,6 +2089,13 @@  static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 	mana_get_rxbuf_cfg(ndev->mtu, &rxq->datasize, &rxq->alloc_size,
 			   &rxq->headroom);
 
+	/* Create page pool for RX queue */
+	err = mana_create_page_pool(rxq);
+	if (err) {
+		netdev_err(ndev, "Create page pool err:%d\n", err);
+		goto out;
+	}
+
 	err = mana_alloc_rx_wqe(apc, rxq, &rq_size, &cq_size);
 	if (err)
 		goto out;
@@ -2108,8 +2167,8 @@  static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
 
 	WARN_ON(xdp_rxq_info_reg(&rxq->xdp_rxq, ndev, rxq_idx,
 				 cq->napi.napi_id));
-	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq,
-					   MEM_TYPE_PAGE_SHARED, NULL));
+	WARN_ON(xdp_rxq_info_reg_mem_model(&rxq->xdp_rxq, MEM_TYPE_PAGE_POOL,
+					   rxq->page_pool));
 
 	napi_enable(&cq->napi);
 
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 024ad8ddb27e..b12859511839 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -280,6 +280,7 @@  struct mana_recv_buf_oob {
 	struct gdma_wqe_request wqe_req;
 
 	void *buf_va;
+	bool from_pool; /* allocated from a page pool */
 
 	/* SGL of the buffer going to be sent has part of the work request. */
 	u32 num_sge;
@@ -330,6 +331,8 @@  struct mana_rxq {
 	bool xdp_flush;
 	int xdp_rc; /* XDP redirect return code */
 
+	struct page_pool *page_pool;
+
 	/* MUST BE THE LAST MEMBER:
 	 * Each receive buffer has an associated mana_recv_buf_oob.
 	 */