diff mbox series

[net-next,3/3] bnxt_en: Let the page pool manage the DMA mapping

Message ID 20230728231829.235716-4-michael.chan@broadcom.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series bnxt_en: Add support for page pool | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1328 this patch: 1328
netdev/cc_maintainers success CCed 6 of 6 maintainers
netdev/build_clang success Errors and warnings before: 1351 this patch: 1351
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1351 this patch: 1351
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 80 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Michael Chan July 28, 2023, 11:18 p.m. UTC
From: Somnath Kotur <somnath.kotur@broadcom.com>

Use the page pool's ability to maintain DMA mappings for us.
This avoids re-mapping of the recycled pages.

Signed-off-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 32 +++++++----------------
 1 file changed, 10 insertions(+), 22 deletions(-)

Comments

Jakub Kicinski July 29, 2023, 12:42 a.m. UTC | #1
On Fri, 28 Jul 2023 16:18:29 -0700 Michael Chan wrote:
> +	pp.dma_dir = bp->rx_dir;
> +	pp.max_len = BNXT_RX_PAGE_SIZE;

I _think_ you need PAGE_SIZE here.

This should be smaller than PAGE_SIZE only if you're wasting the rest
of the buffer, e.g. MTU is 3k so you know last 1k will never get used.
PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing.

Adding Jesper to CC to keep me honest.

> +	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
Jesper Dangaard Brouer July 31, 2023, 5:47 p.m. UTC | #2
On 29/07/2023 02.42, Jakub Kicinski wrote:
> On Fri, 28 Jul 2023 16:18:29 -0700 Michael Chan wrote:
>> +	pp.dma_dir = bp->rx_dir;
>> +	pp.max_len = BNXT_RX_PAGE_SIZE;
> 
> I _think_ you need PAGE_SIZE here.
> 

I actually think pp.max_len = BNXT_RX_PAGE_SIZE is correct here.
(Although it can be optimized, see below)

> This should be smaller than PAGE_SIZE only if you're wasting the rest
> of the buffer, e.g. MTU is 3k so you know last 1k will never get used.
> PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing.
> 

Remember pp.max_len is used for dma_sync_for_device.
If driver is smart, it can set pp.max_len according to MTU, as the (DMA
sync for) device knows hardware will not go beyond this.
On Intel "dma_sync_for_device" is a no-op, so most drivers done
optimized for this. I remember is had HUGE effects on ARM EspressoBin board.


> Adding Jesper to CC to keep me honest.

Adding Ilias to keep me honest ;-)

To follow/understand these changes, reviewers need to keep the context
of patch 1/3 in mind [1].

[1] 
https://lore.kernel.org/all/20230728231829.235716-2-michael.chan@broadcom.com/


> 
>> +	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
> 

--Jesper
Jakub Kicinski July 31, 2023, 6 p.m. UTC | #3
On Mon, 31 Jul 2023 19:47:08 +0200 Jesper Dangaard Brouer wrote:
> > This should be smaller than PAGE_SIZE only if you're wasting the rest
> > of the buffer, e.g. MTU is 3k so you know last 1k will never get used.
> > PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing.
> 
> Remember pp.max_len is used for dma_sync_for_device.
> If driver is smart, it can set pp.max_len according to MTU, as the (DMA
> sync for) device knows hardware will not go beyond this.
> On Intel "dma_sync_for_device" is a no-op, so most drivers done
> optimized for this. I remember is had HUGE effects on ARM EspressoBin board.

Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO,
they will be filled with TCP payload start to end. page_pool_put_page()
does nothing for non-last frag, so we'll only sync for the last
(BNXT_RX_PAGE-sized) frag released, and we need to sync the entire 
host page.
Michael Chan July 31, 2023, 6:16 p.m. UTC | #4
On Mon, Jul 31, 2023 at 11:00 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 31 Jul 2023 19:47:08 +0200 Jesper Dangaard Brouer wrote:
> > > This should be smaller than PAGE_SIZE only if you're wasting the rest
> > > of the buffer, e.g. MTU is 3k so you know last 1k will never get used.
> > > PAGE_SIZE is always a multiple of BNXT_RX_PAGE so you waste nothing.
> >
> > Remember pp.max_len is used for dma_sync_for_device.
> > If driver is smart, it can set pp.max_len according to MTU, as the (DMA
> > sync for) device knows hardware will not go beyond this.
> > On Intel "dma_sync_for_device" is a no-op, so most drivers done
> > optimized for this. I remember is had HUGE effects on ARM EspressoBin board.
>
> Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO,
> they will be filled with TCP payload start to end. page_pool_put_page()
> does nothing for non-last frag, so we'll only sync for the last
> (BNXT_RX_PAGE-sized) frag released, and we need to sync the entire
> host page.

Correct, there is no MTU here.  Remember this matters only when
PAGE_SIZE > BNXT_RX_PAGE_SIZE (e.g. 64K PAGE_SIZE and 32K
BNXT_RX_PAGE_SIZE).  I think we want to dma_sync_for_device for 32K in
this case.
Jakub Kicinski July 31, 2023, 6:44 p.m. UTC | #5
On Mon, 31 Jul 2023 11:16:55 -0700 Michael Chan wrote:
> > > Remember pp.max_len is used for dma_sync_for_device.
> > > If driver is smart, it can set pp.max_len according to MTU, as the (DMA
> > > sync for) device knows hardware will not go beyond this.
> > > On Intel "dma_sync_for_device" is a no-op, so most drivers done
> > > optimized for this. I remember is had HUGE effects on ARM EspressoBin board.  
> >
> > Note that (AFAIU) there is no MTU here, these are pages for LRO/GRO,
> > they will be filled with TCP payload start to end. page_pool_put_page()
> > does nothing for non-last frag, so we'll only sync for the last
> > (BNXT_RX_PAGE-sized) frag released, and we need to sync the entire
> > host page.  
> 
> Correct, there is no MTU here.  Remember this matters only when
> PAGE_SIZE > BNXT_RX_PAGE_SIZE (e.g. 64K PAGE_SIZE and 32K
> BNXT_RX_PAGE_SIZE).  I think we want to dma_sync_for_device for 32K in
> this case.

Maybe I'm misunderstanding. Let me tell you how I think this works and
perhaps we should update the docs based on this discussion.

Note that the max_len is applied to the full host page when the full
host page is returned. Not to fragments, and not at allocation.

The .max_len is the max offset within the host page that the HW may
access. For page-per-packet, 1500B MTU this could matter quite a bit,
because we only have to sync ~1500B rather than 4096B.

      some wasted headroom/padding, pp.offset can be used to skip
    /        device may touch this section
   /        /                     device will not touch, sync not needed
  /        /                     /
|**| ===== MTU 1500B ====== | - skb_shinfo and unused --- |
   <------ .max_len -------->

For fragmented pages it becomes:

                         middle skb_shinfo
                        /                         remainder
                       /                               |
|**| == MTU == | - shinfo- |**| == MTU == | - shinfo- |+++|
   <------------ .max_len ---------------->

So max_len will only exclude the _last_ shinfo and the wasted space
(reminder of dividing page by buffer size). We must sync _all_ packet
sections ("== MTU ==") within the packet.

In bnxt's case - the page is fragmented (latter diagram), and there is
no start offset or wasted space. Ergo .max_len = PAGE_SIZE.

Where did I get off the track?
Michael Chan July 31, 2023, 8:20 p.m. UTC | #6
On Mon, Jul 31, 2023 at 11:44 AM Jakub Kicinski <kuba@kernel.org> wrote:
> Maybe I'm misunderstanding. Let me tell you how I think this works and
> perhaps we should update the docs based on this discussion.
>
> Note that the max_len is applied to the full host page when the full
> host page is returned. Not to fragments, and not at allocation.
>

I think I am beginning to understand what the confusion is.  These 32K
page fragments within the page may not belong to the same (GRO)
packet.  So we cannot dma_sync the whole page at the same time.
Without setting PP_FLAG_DMA_SYNC_DEV, the driver code should be
something like this:

mapping = page_pool_get_dma_addr(page) + offset;
dma_sync_single_for_device(dev, mapping, BNXT_RX_PAGE_SIZE, bp->rx_dir);

offset may be 0, 32K, etc.

Since the PP_FLAG_DMA_SYNC_DEV logic is not aware of this offset, we
actually must do our own dma_sync and not use PP_FLAG_DMA_SYNC_DEV in
this case.  Does that sound right?
Jakub Kicinski July 31, 2023, 8:44 p.m. UTC | #7
On Mon, 31 Jul 2023 13:20:04 -0700 Michael Chan wrote:
> I think I am beginning to understand what the confusion is.  These 32K
> page fragments within the page may not belong to the same (GRO)
> packet.

Right.

> So we cannot dma_sync the whole page at the same time.

I wouldn't phrase it like that.

> Without setting PP_FLAG_DMA_SYNC_DEV, the driver code should be
> something like this:
> 
> mapping = page_pool_get_dma_addr(page) + offset;
> dma_sync_single_for_device(dev, mapping, BNXT_RX_PAGE_SIZE, bp->rx_dir);
> 
> offset may be 0, 32K, etc.
> 
> Since the PP_FLAG_DMA_SYNC_DEV logic is not aware of this offset, we
> actually must do our own dma_sync and not use PP_FLAG_DMA_SYNC_DEV in
> this case.  Does that sound right?

No, no, all I'm saying is that with the current code (in page pool)
you can't be very intelligent about the sync'ing. Every time a page
enters the pool - the whole page should be synced. But that's fine,
it's still better to let page pool do the syncing than trying to
do it manually in the driver (since freshly allocated pages do not 
have to be synced).

I think the confusion comes partially from the fact that the driver
only ever deals with fragments (32k), but internally page pool does
recycling in full pages (64k). And .max_len is part of the recycling
machinery, so to speak, not part of the allocation machinery.

tl;dr just set .max_len = PAGE_SIZE and all will be right.
Michael Chan July 31, 2023, 9:11 p.m. UTC | #8
On Mon, Jul 31, 2023 at 1:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
> tl;dr just set .max_len = PAGE_SIZE and all will be right.

OK I think I got it now.  The page is only recycled when all the
fragments are recycled and so we can let page pool DMA sync the whole
page at that time.
Jesper Dangaard Brouer Aug. 1, 2023, 5:06 p.m. UTC | #9
On 31/07/2023 23.11, Michael Chan wrote:
> On Mon, Jul 31, 2023 at 1:44 PM Jakub Kicinski <kuba@kernel.org> wrote:
>> tl;dr just set .max_len = PAGE_SIZE and all will be right.
> 
> OK I think I got it now.  The page is only recycled when all the
> fragments are recycled and so we can let page pool DMA sync the whole
> page at that time.

Yes, Jakub is right, I see that now.

When using page_pool "frag" API (e.g. page_pool_dev_alloc_frag) then the
optimization I talked about isn't valid.  We simply have to DMA sync the
entire page, when it gets back to the recycle stage.

--Jesper
diff mbox series

Patch

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index adf785b7aa42..b35bc92094ce 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -759,7 +759,6 @@  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 					 unsigned int *offset,
 					 gfp_t gfp)
 {
-	struct device *dev = &bp->pdev->dev;
 	struct page *page;
 
 	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) {
@@ -772,12 +771,7 @@  static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
 	if (!page)
 		return NULL;
 
-	*mapping = dma_map_page_attrs(dev, page, *offset, BNXT_RX_PAGE_SIZE,
-				      bp->rx_dir, DMA_ATTR_WEAK_ORDERING);
-	if (dma_mapping_error(dev, *mapping)) {
-		page_pool_recycle_direct(rxr->page_pool, page);
-		return NULL;
-	}
+	*mapping = page_pool_get_dma_addr(page) + *offset;
 	return page;
 }
 
@@ -996,8 +990,8 @@  static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, bp->rx_dir,
-			     DMA_ATTR_WEAK_ORDERING);
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+				bp->rx_dir);
 	skb = build_skb(data_ptr - bp->rx_offset, BNXT_RX_PAGE_SIZE);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
@@ -1030,8 +1024,8 @@  static struct sk_buff *bnxt_rx_page_skb(struct bnxt *bp,
 		return NULL;
 	}
 	dma_addr -= bp->rx_dma_offset;
-	dma_unmap_page_attrs(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE, bp->rx_dir,
-			     DMA_ATTR_WEAK_ORDERING);
+	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, BNXT_RX_PAGE_SIZE,
+				bp->rx_dir);
 
 	if (unlikely(!payload))
 		payload = eth_get_headlen(bp->dev, data_ptr, len);
@@ -1147,9 +1141,8 @@  static u32 __bnxt_rx_agg_pages(struct bnxt *bp,
 			return 0;
 		}
 
-		dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
-				     bp->rx_dir,
-				     DMA_ATTR_WEAK_ORDERING);
+		dma_sync_single_for_cpu(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
+					bp->rx_dir);
 
 		total_frag_len += frag_len;
 		prod = NEXT_RX_AGG(prod);
@@ -2945,10 +2938,6 @@  static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
 
 		rx_buf->data = NULL;
 		if (BNXT_RX_PAGE_MODE(bp)) {
-			mapping -= bp->rx_dma_offset;
-			dma_unmap_page_attrs(&pdev->dev, mapping, BNXT_RX_PAGE_SIZE,
-					     bp->rx_dir,
-					     DMA_ATTR_WEAK_ORDERING);
 			page_pool_recycle_direct(rxr->page_pool, data);
 		} else {
 			dma_unmap_single_attrs(&pdev->dev, mapping,
@@ -2969,9 +2958,6 @@  static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr)
 		if (!page)
 			continue;
 
-		dma_unmap_page_attrs(&pdev->dev, rx_agg_buf->mapping,
-				     BNXT_RX_PAGE_SIZE, bp->rx_dir,
-				     DMA_ATTR_WEAK_ORDERING);
 		rx_agg_buf->page = NULL;
 		__clear_bit(i, rxr->rx_agg_bmap);
 
@@ -3203,7 +3189,9 @@  static int bnxt_alloc_rx_page_pool(struct bnxt *bp,
 	pp.nid = dev_to_node(&bp->pdev->dev);
 	pp.napi = &rxr->bnapi->napi;
 	pp.dev = &bp->pdev->dev;
-	pp.dma_dir = DMA_BIDIRECTIONAL;
+	pp.dma_dir = bp->rx_dir;
+	pp.max_len = BNXT_RX_PAGE_SIZE;
+	pp.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV;
 	if (PAGE_SIZE > BNXT_RX_PAGE_SIZE)
 		pp.flags |= PP_FLAG_PAGE_FRAG;