Message ID | 668cfa117e41a0f1325593c94f6bb739c3bb38da.1736777576.git.0x1207@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | net: stmmac: RX performance improvement | expand |
On Mon, Jan 13, 2025 at 10:20:31PM +0800, Furong Xu wrote: > Current code prefetches cache lines for the received frame first, and > then dma_sync_single_for_cpu() against this frame, this is wrong. > Cache prefetch should be triggered after dma_sync_single_for_cpu(). > > This patch brings ~2.8% driver performance improvement in a TCP RX > throughput test with iPerf tool on a single isolated Cortex-A65 CPU > core, 2.84 Gbits/sec increased to 2.92 Gbits/sec. > > Signed-off-by: Furong Xu <0x1207@gmail.com> > --- > drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +---- > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > index ca340fd8c937..b60f2f27140c 100644 > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) > > /* Buffer is good. Go on. */ > > - prefetch(page_address(buf->page) + buf->page_offset); > - if (buf->sec_page) > - prefetch(page_address(buf->sec_page)); > - > buf1_len = stmmac_rx_buf1_len(priv, p, status, len); > len += buf1_len; > buf2_len = stmmac_rx_buf2_len(priv, p, status, len); > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) > > dma_sync_single_for_cpu(priv->device, buf->addr, > buf1_len, dma_dir); > + prefetch(page_address(buf->page) + buf->page_offset); Minor nit: I've seen in other drivers authors using net_prefetch. Probably not worth a re-roll just for something this minor.
On Tue, 14 Jan 2025 15:31:05 -0800 Joe Damato wrote: > > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) > > > > dma_sync_single_for_cpu(priv->device, buf->addr, > > buf1_len, dma_dir); > > + prefetch(page_address(buf->page) + buf->page_offset); > > Minor nit: I've seen in other drivers authors using net_prefetch. > Probably not worth a re-roll just for something this minor. Let's respin. I don't know how likely stmmac is to be integrated into an SoC with 64B cachelines these days, but since you caught this - why not potentially save someone from investigating this later..
On Tue, 14 Jan 2025 15:31:05 -0800, Joe Damato <jdamato@fastly.com> wrote: > On Mon, Jan 13, 2025 at 10:20:31PM +0800, Furong Xu wrote: > > Current code prefetches cache lines for the received frame first, and > > then dma_sync_single_for_cpu() against this frame, this is wrong. > > Cache prefetch should be triggered after dma_sync_single_for_cpu(). > > > > This patch brings ~2.8% driver performance improvement in a TCP RX > > throughput test with iPerf tool on a single isolated Cortex-A65 CPU > > core, 2.84 Gbits/sec increased to 2.92 Gbits/sec. > > > > Signed-off-by: Furong Xu <0x1207@gmail.com> > > --- > > drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +---- > > 1 file changed, 1 insertion(+), 4 deletions(-) > > > > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > index ca340fd8c937..b60f2f27140c 100644 > > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c > > @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) > > > > /* Buffer is good. Go on. */ > > > > - prefetch(page_address(buf->page) + buf->page_offset); > > - if (buf->sec_page) > > - prefetch(page_address(buf->sec_page)); > > - > > buf1_len = stmmac_rx_buf1_len(priv, p, status, len); > > len += buf1_len; > > buf2_len = stmmac_rx_buf2_len(priv, p, status, len); > > @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) > > > > dma_sync_single_for_cpu(priv->device, buf->addr, > > buf1_len, dma_dir); > > + prefetch(page_address(buf->page) + buf->page_offset); > > Minor nit: I've seen in other drivers authors using net_prefetch. > Probably not worth a re-roll just for something this minor. After switch to net_prefetch(), I get another 4.5% throughput improvement :) Thanks! This definitely worth a v3 of this series. pw-bot: changes-requested
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index ca340fd8c937..b60f2f27140c 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -5500,10 +5500,6 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) /* Buffer is good. Go on. */ - prefetch(page_address(buf->page) + buf->page_offset); - if (buf->sec_page) - prefetch(page_address(buf->sec_page)); - buf1_len = stmmac_rx_buf1_len(priv, p, status, len); len += buf1_len; buf2_len = stmmac_rx_buf2_len(priv, p, status, len); @@ -5525,6 +5521,7 @@ static int stmmac_rx(struct stmmac_priv *priv, int limit, u32 queue) dma_sync_single_for_cpu(priv->device, buf->addr, buf1_len, dma_dir); + prefetch(page_address(buf->page) + buf->page_offset); xdp_init_buff(&ctx.xdp, buf_sz, &rx_q->xdp_rxq); xdp_prepare_buff(&ctx.xdp, page_address(buf->page),
Current code prefetches cache lines for the received frame first, and then dma_sync_single_for_cpu() against this frame, this is wrong. Cache prefetch should be triggered after dma_sync_single_for_cpu(). This patch brings ~2.8% driver performance improvement in a TCP RX throughput test with iPerf tool on a single isolated Cortex-A65 CPU core, 2.84 Gbits/sec increased to 2.92 Gbits/sec. Signed-off-by: Furong Xu <0x1207@gmail.com> --- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)