Message ID | d3ae6bd3537fbce379382ac6a42f67e22f27ece2.1683896626.git.lorenzo@kernel.org (mailing list archive) |
---|---|
State | RFC |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [RFC,net-next] net: veth: reduce page_pool memory footprint using half page per-buffer | expand |
Context | Check | Description |
---|---|---|
netdev/series_format | success | Single patches do not need cover letters |
netdev/tree_selection | success | Clearly marked for net-next |
netdev/fixes_present | success | Fixes tag not required for -next series |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 8 this patch: 8 |
netdev/cc_maintainers | success | CCed 6 of 6 maintainers |
netdev/build_clang | success | Errors and warnings before: 8 this patch: 8 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/deprecated_api | success | None detected |
netdev/check_selftest | success | No net selftest shell script |
netdev/verify_fixes | success | No Fixes tag |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 8 this patch: 8 |
netdev/checkpatch | warning | WARNING: line length of 82 exceeds 80 columns |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Fri, 12 May 2023 15:08:13 +0200 > In order to reduce page_pool memory footprint, rely on > page_pool_dev_alloc_frag routine and reduce buffer size > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > supported MTU is now reduced to 36350B. I thought we're stepping away from page splitting bit by bit O_o Primarily for the reasons you mentioned / worked around here: it creates several significant limitations and at least on 64-bit systems it doesn't scale anymore. 192 bytes of headroom is less than what XDP expects (isn't it? Isn't 256 standard-standard, so that skb XDP path reallocates heads only to have 256+ there?), 384 bytes of shinfo can change anytime and even now page split simply blocks you from increasing MAX_SKB_FRAGS even by one. Not speaking of MTU limitations etc. BTW Intel drivers suffer from the very same things due solely to page split (and I'm almost done with converting at least some of them to Page Pool and 1 page per buffer model), I don't recommend deliberately falling into that pit =\ :D > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> > --- > drivers/net/veth.c | 39 +++++++++++++++++++++++++-------------- > 1 file changed, 25 insertions(+), 14 deletions(-) [...] Thanks, Olek
> From: Lorenzo Bianconi <lorenzo@kernel.org> > Date: Fri, 12 May 2023 15:08:13 +0200 > > > In order to reduce page_pool memory footprint, rely on > > page_pool_dev_alloc_frag routine and reduce buffer size > > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > > supported MTU is now reduced to 36350B. > > I thought we're stepping away from page splitting bit by bit O_o do you mean to driver private page_split implementation? AFAIK we are not stepping away from page_pool page split implementation (or maybe I missed it :)) > Primarily for the reasons you mentioned / worked around here: it creates > several significant limitations and at least on 64-bit systems it > doesn't scale anymore. 192 bytes of headroom is less than what XDP > expects (isn't it? Isn't 256 standard-standard, so that skb XDP path > reallocates heads only to have 256+ there?), 384 bytes of shinfo can > change anytime and even now page split simply blocks you from increasing > MAX_SKB_FRAGS even by one. Not speaking of MTU limitations etc. > BTW Intel drivers suffer from the very same things due solely to page > split (and I'm almost done with converting at least some of them to Page > Pool and 1 page per buffer model), I don't recommend deliberately > falling into that pit =\ :D I am not sure about the 192 vs 256 bytes of headroom (this is why I sent this patch as RFC, my main goal is to discuss about this requirement). In the previous discussion [0] we deferred this implementation since if we do not reduce requested xdp headroom, we will not be able to fit two 1500B frames into a single page (for skb_shared_info size [1]) and we introduce a performance penalty. Regards, Lorenzo [0] https://lore.kernel.org/netdev/6298f73f7cc7391c7c4a52a6a89b1ae21488bda1.1682188837.git.lorenzo@kernel.org/ [1] $ pahole -C skb_shared_info vmlinux.o struct skb_shared_info { __u8 flags; /* 0 1 */ __u8 meta_len; /* 1 1 */ __u8 nr_frags; /* 2 1 */ __u8 tx_flags; /* 3 1 */ unsigned short gso_size; /* 4 2 */ unsigned short gso_segs; /* 6 2 */ struct sk_buff * frag_list; /* 8 8 */ struct skb_shared_hwtstamps hwtstamps; /* 16 8 */ unsigned int gso_type; /* 24 4 */ u32 tskey; /* 28 4 */ atomic_t dataref; /* 32 4 */ unsigned int xdp_frags_size; /* 36 4 */ void * destructor_arg; /* 40 8 */ skb_frag_t frags[17]; /* 48 272 */ /* size: 320, cachelines: 5, members: 14 */ }; > > > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> > > --- > > drivers/net/veth.c | 39 +++++++++++++++++++++++++-------------- > > 1 file changed, 25 insertions(+), 14 deletions(-) > [...] > > Thanks, > Olek
On 2023/5/12 21:08, Lorenzo Bianconi wrote: > In order to reduce page_pool memory footprint, rely on > page_pool_dev_alloc_frag routine and reduce buffer size > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page Is there any performance improvement beside the memory saving? As it should reduce TLB miss, I wonder if the TLB miss reducing can even out the cost of the extra frag reference count handling for the frag support? > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > supported MTU is now reduced to 36350B. Maybe we don't need to limit the frag size to VETH_PAGE_POOL_FRAG_SIZE, and use different frag size depending on the mtu or packet size? Perhaps the page_pool_dev_alloc_frag() can be improved to return non-frag page if the requested frag size is larger than a specified size too. I will try to implement it if the above idea makes sense.
> On 2023/5/12 21:08, Lorenzo Bianconi wrote: > > In order to reduce page_pool memory footprint, rely on > > page_pool_dev_alloc_frag routine and reduce buffer size > > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > > Is there any performance improvement beside the memory saving? As it > should reduce TLB miss, I wonder if the TLB miss reducing can even > out the cost of the extra frag reference count handling for the > frag support? reducing the requested headroom to 192 (from 256) we have a nice improvement in the 1500B frame case while it is mostly the same in the case of paged skb (e.g. MTU 8000B). > > > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > > supported MTU is now reduced to 36350B. > > Maybe we don't need to limit the frag size to VETH_PAGE_POOL_FRAG_SIZE, > and use different frag size depending on the mtu or packet size? > > Perhaps the page_pool_dev_alloc_frag() can be improved to return non-frag > page if the requested frag size is larger than a specified size too. > I will try to implement it if the above idea makes sense. > since there are no significant differences between full page and fragmented page implementation if the MTU is over the page boundary, does it worth to do so? (at least for the veth use-case). Regards, Lorenzo
On Mon, May 15, 2023 at 01:24:20PM +0200, Lorenzo Bianconi wrote: > > On 2023/5/12 21:08, Lorenzo Bianconi wrote: > > > In order to reduce page_pool memory footprint, rely on > > > page_pool_dev_alloc_frag routine and reduce buffer size > > > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > > > > Is there any performance improvement beside the memory saving? As it > > should reduce TLB miss, I wonder if the TLB miss reducing can even > > out the cost of the extra frag reference count handling for the > > frag support? > > reducing the requested headroom to 192 (from 256) we have a nice improvement in > the 1500B frame case while it is mostly the same in the case of paged skb > (e.g. MTU 8000B). Can you define 'nice improvement' ? ;) Show us numbers or improvement in %. > > > > > > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > > > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > > > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > > > supported MTU is now reduced to 36350B. > > > > Maybe we don't need to limit the frag size to VETH_PAGE_POOL_FRAG_SIZE, > > and use different frag size depending on the mtu or packet size? > > > > Perhaps the page_pool_dev_alloc_frag() can be improved to return non-frag > > page if the requested frag size is larger than a specified size too. > > I will try to implement it if the above idea makes sense. > > > > since there are no significant differences between full page and fragmented page > implementation if the MTU is over the page boundary, does it worth to do so? > (at least for the veth use-case). > > Regards, > Lorenzo >
From: Lorenzo Bianconi <lorenzo@kernel.org> Date: Fri, 12 May 2023 16:14:48 +0200 >> From: Lorenzo Bianconi <lorenzo@kernel.org> >> Date: Fri, 12 May 2023 15:08:13 +0200 >> >>> In order to reduce page_pool memory footprint, rely on >>> page_pool_dev_alloc_frag routine and reduce buffer size >>> (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page >>> for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 >>> (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. >>> Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum >>> supported MTU is now reduced to 36350B. >> >> I thought we're stepping away from page splitting bit by bit O_o > > do you mean to driver private page_split implementation? AFAIK we are not > stepping away from page_pool page split implementation (or maybe I missed it :)) Page split in general. Since early-mid 2021, Page Pool with 1 page per frame shows results comparable to drivers doing page split, but it doesn't have such MTU / headroom / ... limitations. > >> Primarily for the reasons you mentioned / worked around here: it creates >> several significant limitations and at least on 64-bit systems it >> doesn't scale anymore. 192 bytes of headroom is less than what XDP >> expects (isn't it? Isn't 256 standard-standard, so that skb XDP path >> reallocates heads only to have 256+ there?), 384 bytes of shinfo can >> change anytime and even now page split simply blocks you from increasing >> MAX_SKB_FRAGS even by one. Not speaking of MTU limitations etc. >> BTW Intel drivers suffer from the very same things due solely to page >> split (and I'm almost done with converting at least some of them to Page >> Pool and 1 page per buffer model), I don't recommend deliberately >> falling into that pit =\ :D > > I am not sure about the 192 vs 256 bytes of headroom (this is why I sent this > patch as RFC, my main goal is to discuss about this requirement). In the > previous discussion [0] we deferred this implementation since if we do not > reduce requested xdp headroom, we will not be able to fit two 1500B frames > into a single page (for skb_shared_info size [1]) and we introduce a performance > penalty. Yes, that's what I'm talking about. In order to fit two 1500-byte frames onto one page on x86_64, you need to have at most 192 bytes of headroom (192 + 320 of shinfo = 512 per frame / 1024 per page), but XDP requires 256. And then you have one more problem from the other side, I mean shinfo size. It can change anytime because it's not UAPI or ABI or whatever and nobody can say "hey, don't touch it, you break my page split", at the same time with page splitting you're not able to increase MAX_SKB_FRAGS. And for MTU > 1536 this is all worthless, just a waste of cycles. With 1 page per frame you can have up to 3.5k per fragment. You mentioned memory footprint. Do you have any exact numbers to show this can help significantly? Because I have PP on my home router with 16k-sized pages and 128 Mb of RAM and there's no memory shortage there :D I realize it doesn't mean anything and I'm mostly joking mentioning this, but still. > > Regards, > Lorenzo > > [0] https://lore.kernel.org/netdev/6298f73f7cc7391c7c4a52a6a89b1ae21488bda1.1682188837.git.lorenzo@kernel.org/ > [1] $ pahole -C skb_shared_info vmlinux.o > struct skb_shared_info { > __u8 flags; /* 0 1 */ > __u8 meta_len; /* 1 1 */ > __u8 nr_frags; /* 2 1 */ > __u8 tx_flags; /* 3 1 */ > unsigned short gso_size; /* 4 2 */ > unsigned short gso_segs; /* 6 2 */ > struct sk_buff * frag_list; /* 8 8 */ > struct skb_shared_hwtstamps hwtstamps; /* 16 8 */ > unsigned int gso_type; /* 24 4 */ > u32 tskey; /* 28 4 */ > atomic_t dataref; /* 32 4 */ > unsigned int xdp_frags_size; /* 36 4 */ > void * destructor_arg; /* 40 8 */ > skb_frag_t frags[17]; /* 48 272 */ > > /* size: 320, cachelines: 5, members: 14 */ > }; > >> >>> >>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> >>> --- >>> drivers/net/veth.c | 39 +++++++++++++++++++++++++-------------- >>> 1 file changed, 25 insertions(+), 14 deletions(-) >> [...] >> >> Thanks, >> Olek Thanks, Olek
On 2023/5/15 19:24, Lorenzo Bianconi wrote: >> On 2023/5/12 21:08, Lorenzo Bianconi wrote: >>> In order to reduce page_pool memory footprint, rely on >>> page_pool_dev_alloc_frag routine and reduce buffer size >>> (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page >> >> Is there any performance improvement beside the memory saving? As it >> should reduce TLB miss, I wonder if the TLB miss reducing can even >> out the cost of the extra frag reference count handling for the >> frag support? > > reducing the requested headroom to 192 (from 256) we have a nice improvement in > the 1500B frame case while it is mostly the same in the case of paged skb > (e.g. MTU 8000B). > >> >>> for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 >>> (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. >>> Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum >>> supported MTU is now reduced to 36350B. >> >> Maybe we don't need to limit the frag size to VETH_PAGE_POOL_FRAG_SIZE, >> and use different frag size depending on the mtu or packet size? >> >> Perhaps the page_pool_dev_alloc_frag() can be improved to return non-frag >> page if the requested frag size is larger than a specified size too. >> I will try to implement it if the above idea makes sense. >> > > since there are no significant differences between full page and fragmented page > implementation if the MTU is over the page boundary, does it worth to do so? > (at least for the veth use-case). Yes, as there is no significant differences between full page and fragmented page implementation, unifying the interface is trivial, I have sent a RFC for that. Using that interface may solve the supported mtu reducing problem at least. > > Regards, > Lorenzo >
Cc. Alex Duyck + Eric, please criticize my idea below. On 12/05/2023 15.08, Lorenzo Bianconi wrote: > In order to reduce page_pool memory footprint, rely on > page_pool_dev_alloc_frag routine and reduce buffer size > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > supported MTU is now reduced to 36350B. > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> > --- > drivers/net/veth.c | 39 +++++++++++++++++++++++++-------------- > 1 file changed, 25 insertions(+), 14 deletions(-) > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 614f3e3efab0..0e648703cccf 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -31,9 +31,12 @@ > #define DRV_NAME "veth" > #define DRV_VERSION "1.0" > > -#define VETH_XDP_FLAG BIT(0) > -#define VETH_RING_SIZE 256 > -#define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) > +#define VETH_XDP_FLAG BIT(0) > +#define VETH_RING_SIZE 256 > +#define VETH_XDP_PACKET_HEADROOM 192 > +#define VETH_XDP_HEADROOM (VETH_XDP_PACKET_HEADROOM + \ > + NET_IP_ALIGN) > +#define VETH_PAGE_POOL_FRAG_SIZE 2048 > > #define VETH_XDP_TX_BULK_SIZE 16 > #define VETH_XDP_BATCH 16 > @@ -736,7 +739,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > if (skb_shared(skb) || skb_head_is_locked(skb) || > skb_shinfo(skb)->nr_frags || > skb_headroom(skb) < XDP_PACKET_HEADROOM) { > - u32 size, len, max_head_size, off; > + u32 size, len, max_head_size, off, pp_off; > struct sk_buff *nskb; > struct page *page; > int i, head_off; > @@ -747,17 +750,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > * > * Make sure we have enough space for linear and paged area > */ > - max_head_size = SKB_WITH_OVERHEAD(PAGE_SIZE - > + max_head_size = SKB_WITH_OVERHEAD(VETH_PAGE_POOL_FRAG_SIZE - > VETH_XDP_HEADROOM); > - if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) > + if (skb->len > > + VETH_PAGE_POOL_FRAG_SIZE * MAX_SKB_FRAGS + max_head_size) > goto drop; > > /* Allocate skb head */ > - page = page_pool_dev_alloc_pages(rq->page_pool); It seems wasteful to allocate a full page PAGE_SIZE. > + page = page_pool_dev_alloc_frag(rq->page_pool, &pp_off, > + VETH_PAGE_POOL_FRAG_SIZE); Allocating PAGE_SIZE/2 isn't much better. At this point we already know the skb->len (and skb_headlen). Why don't we allocated the size that we need? See page_frag_alloc() system invented by Eric and Duyck. > if (!page) > goto drop; > > - nskb = napi_build_skb(page_address(page), PAGE_SIZE); > + nskb = napi_build_skb(page_address(page) + pp_off, > + VETH_PAGE_POOL_FRAG_SIZE); > if (!nskb) { > page_pool_put_full_page(rq->page_pool, page, true); > goto drop; > @@ -782,15 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > len = skb->len - off; > > for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { > - page = page_pool_dev_alloc_pages(rq->page_pool); > + page = page_pool_dev_alloc_frag(rq->page_pool, &pp_off, > + VETH_PAGE_POOL_FRAG_SIZE); > if (!page) { > consume_skb(nskb); > goto drop; > } > > - size = min_t(u32, len, PAGE_SIZE); > - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); > - if (skb_copy_bits(skb, off, page_address(page), > + size = min_t(u32, len, VETH_PAGE_POOL_FRAG_SIZE); > + skb_add_rx_frag(nskb, i, page, pp_off, size, > + VETH_PAGE_POOL_FRAG_SIZE); > + if (skb_copy_bits(skb, off, > + page_address(page) + pp_off, > size)) { > consume_skb(nskb); > goto drop; [...]
> On Mon, May 15, 2023 at 01:24:20PM +0200, Lorenzo Bianconi wrote: > > > On 2023/5/12 21:08, Lorenzo Bianconi wrote: > > > > In order to reduce page_pool memory footprint, rely on > > > > page_pool_dev_alloc_frag routine and reduce buffer size > > > > (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > > > > > > Is there any performance improvement beside the memory saving? As it > > > should reduce TLB miss, I wonder if the TLB miss reducing can even > > > out the cost of the extra frag reference count handling for the > > > frag support? > > > > reducing the requested headroom to 192 (from 256) we have a nice improvement in > > the 1500B frame case while it is mostly the same in the case of paged skb > > (e.g. MTU 8000B). > > Can you define 'nice improvement' ? ;) > Show us numbers or improvement in %. I am testing this RFC patch in the scenario reported below: iperf tcp tx --> veth0 --> veth1 (xdp_pass) --> iperf tcp rx - 6.4.0-rc1 net-next: MTU 1500B: ~ 7.07 Gbps MTU 8000B: ~ 14.7 Gbps - 6.4.0-rc1 net-next + page_pool frag support in veth: MTU 1500B: ~ 8.57 Gbps MTU 8000B: ~ 14.5 Gbps side note: it seems there is a regression between 6.2.15 and 6.4.0-rc1 net-next (even without latest veth page_pool patches) in the throughput I can get in the scenario above, but I have not looked into it yet. - 6.2.15: MTU 1500B: ~ 7.91 Gbps MTU 8000B: ~ 14.1 Gbps - 6.4.0-rc1 net-next w/o commits [0],[1],[2] MTU 1500B: ~ 6.38 Gbps MTU 8000B: ~ 13.2 Gbps Regards, Lorenzo [0] 0ebab78cbcbf net: veth: add page_pool for page recycling [1] 4fc418053ec7 net: veth: add page_pool stats [2] 9d142ed484a3 net: veth: rely on napi_build_skb in veth_convert_skb_to_xdp_buff > > > > > > > > > > for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 > > > > (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. > > > > Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum > > > > supported MTU is now reduced to 36350B. > > > > > > Maybe we don't need to limit the frag size to VETH_PAGE_POOL_FRAG_SIZE, > > > and use different frag size depending on the mtu or packet size? > > > > > > Perhaps the page_pool_dev_alloc_frag() can be improved to return non-frag > > > page if the requested frag size is larger than a specified size too. > > > I will try to implement it if the above idea makes sense. > > > > > > > since there are no significant differences between full page and fragmented page > > implementation if the MTU is over the page boundary, does it worth to do so? > > (at least for the veth use-case). > > > > Regards, > > Lorenzo > > > >
On 2023/5/17 6:52, Lorenzo Bianconi wrote: >> On Mon, May 15, 2023 at 01:24:20PM +0200, Lorenzo Bianconi wrote: >>>> On 2023/5/12 21:08, Lorenzo Bianconi wrote: >>>>> In order to reduce page_pool memory footprint, rely on >>>>> page_pool_dev_alloc_frag routine and reduce buffer size >>>>> (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page >>>> >>>> Is there any performance improvement beside the memory saving? As it >>>> should reduce TLB miss, I wonder if the TLB miss reducing can even >>>> out the cost of the extra frag reference count handling for the >>>> frag support? >>> >>> reducing the requested headroom to 192 (from 256) we have a nice improvement in >>> the 1500B frame case while it is mostly the same in the case of paged skb >>> (e.g. MTU 8000B). >> >> Can you define 'nice improvement' ? ;) >> Show us numbers or improvement in %. > > I am testing this RFC patch in the scenario reported below: > > iperf tcp tx --> veth0 --> veth1 (xdp_pass) --> iperf tcp rx > > - 6.4.0-rc1 net-next: > MTU 1500B: ~ 7.07 Gbps > MTU 8000B: ~ 14.7 Gbps > > - 6.4.0-rc1 net-next + page_pool frag support in veth: > MTU 1500B: ~ 8.57 Gbps > MTU 8000B: ~ 14.5 Gbps > Thanks for sharing the data. Maybe using the new frag interface introduced in [1] bring back the performance for the MTU 8000B case. 1. https://patchwork.kernel.org/project/netdevbpf/cover/20230516124801.2465-1-linyunsheng@huawei.com/ I drafted a patch for veth to use the new frag interface, maybe that will show how veth can make use of it. Would you give it a try to see if there is any performance improvment for MTU 8000B case? Thanks. --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -737,8 +737,8 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, skb_shinfo(skb)->nr_frags || skb_headroom(skb) < XDP_PACKET_HEADROOM) { u32 size, len, max_head_size, off; + struct page_pool_frag *pp_frag; struct sk_buff *nskb; - struct page *page; int i, head_off; /* We need a private copy of the skb and data buffers since @@ -752,14 +752,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) goto drop; + size = min_t(u32, skb->len, max_head_size); + size += VETH_XDP_HEADROOM; + size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + /* Allocate skb head */ - page = page_pool_dev_alloc_pages(rq->page_pool); - if (!page) + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); + if (!pp_frag) goto drop; - nskb = napi_build_skb(page_address(page), PAGE_SIZE); + nskb = napi_build_skb(page_address(pp_frag->page) + pp_frag->offset, + pp_frag->truesize); if (!nskb) { - page_pool_put_full_page(rq->page_pool, page, true); + page_pool_put_full_page(rq->page_pool, pp_frag->page, + true); goto drop; } @@ -782,16 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, len = skb->len - off; for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { - page = page_pool_dev_alloc_pages(rq->page_pool); - if (!page) { + size = min_t(u32, len, PAGE_SIZE); + + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); + if (!pp_frag) { consume_skb(nskb); goto drop; } - size = min_t(u32, len, PAGE_SIZE); - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); - if (skb_copy_bits(skb, off, page_address(page), - size)) { + skb_add_rx_frag(nskb, i, pp_frag->page, pp_frag->offset, + size, pp_frag->truesize); + if (skb_copy_bits(skb, off, page_address(pp_frag->page) + + pp_frag->offset, size)) { consume_skb(nskb); goto drop; } @@ -1047,6 +1055,8 @@ static int veth_create_page_pool(struct veth_rq *rq) return err; } + page_pool_set_max_frag_size(rq->page_pool, PAGE_SIZE / 2); + return 0; }
> On 2023/5/17 6:52, Lorenzo Bianconi wrote: > >> On Mon, May 15, 2023 at 01:24:20PM +0200, Lorenzo Bianconi wrote: > >>>> On 2023/5/12 21:08, Lorenzo Bianconi wrote: > >>>>> In order to reduce page_pool memory footprint, rely on > >>>>> page_pool_dev_alloc_frag routine and reduce buffer size > >>>>> (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page > >>>> > >>>> Is there any performance improvement beside the memory saving? As it > >>>> should reduce TLB miss, I wonder if the TLB miss reducing can even > >>>> out the cost of the extra frag reference count handling for the > >>>> frag support? > >>> > >>> reducing the requested headroom to 192 (from 256) we have a nice improvement in > >>> the 1500B frame case while it is mostly the same in the case of paged skb > >>> (e.g. MTU 8000B). > >> > >> Can you define 'nice improvement' ? ;) > >> Show us numbers or improvement in %. > > > > I am testing this RFC patch in the scenario reported below: > > > > iperf tcp tx --> veth0 --> veth1 (xdp_pass) --> iperf tcp rx > > > > - 6.4.0-rc1 net-next: > > MTU 1500B: ~ 7.07 Gbps > > MTU 8000B: ~ 14.7 Gbps > > > > - 6.4.0-rc1 net-next + page_pool frag support in veth: > > MTU 1500B: ~ 8.57 Gbps > > MTU 8000B: ~ 14.5 Gbps > > > > Thanks for sharing the data. > Maybe using the new frag interface introduced in [1] bring > back the performance for the MTU 8000B case. > > 1. https://patchwork.kernel.org/project/netdevbpf/cover/20230516124801.2465-1-linyunsheng@huawei.com/ > > > I drafted a patch for veth to use the new frag interface, maybe that > will show how veth can make use of it. Would you give it a try to see > if there is any performance improvment for MTU 8000B case? Thanks. > > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -737,8 +737,8 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > skb_shinfo(skb)->nr_frags || > skb_headroom(skb) < XDP_PACKET_HEADROOM) { > u32 size, len, max_head_size, off; > + struct page_pool_frag *pp_frag; > struct sk_buff *nskb; > - struct page *page; > int i, head_off; > > /* We need a private copy of the skb and data buffers since > @@ -752,14 +752,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) > goto drop; > > + size = min_t(u32, skb->len, max_head_size); > + size += VETH_XDP_HEADROOM; > + size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + > /* Allocate skb head */ > - page = page_pool_dev_alloc_pages(rq->page_pool); > - if (!page) > + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); > + if (!pp_frag) > goto drop; > > - nskb = napi_build_skb(page_address(page), PAGE_SIZE); > + nskb = napi_build_skb(page_address(pp_frag->page) + pp_frag->offset, > + pp_frag->truesize); > if (!nskb) { > - page_pool_put_full_page(rq->page_pool, page, true); > + page_pool_put_full_page(rq->page_pool, pp_frag->page, > + true); > goto drop; > } > > @@ -782,16 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, > len = skb->len - off; > > for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { > - page = page_pool_dev_alloc_pages(rq->page_pool); > - if (!page) { > + size = min_t(u32, len, PAGE_SIZE); > + > + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); > + if (!pp_frag) { > consume_skb(nskb); > goto drop; > } > > - size = min_t(u32, len, PAGE_SIZE); > - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); > - if (skb_copy_bits(skb, off, page_address(page), > - size)) { > + skb_add_rx_frag(nskb, i, pp_frag->page, pp_frag->offset, > + size, pp_frag->truesize); > + if (skb_copy_bits(skb, off, page_address(pp_frag->page) + > + pp_frag->offset, size)) { > consume_skb(nskb); > goto drop; > } > @@ -1047,6 +1055,8 @@ static int veth_create_page_pool(struct veth_rq *rq) > return err; > } IIUC the code here we are using a variable length for linear part (at most one page) while we will always use a full page (exept for the last fragment) for the paged area, correct? I have not tested it yet but I do not think we will get a significant improvement since if we set MTU to 8000B in my tests we get mostly the same throughput (14.5 Gbps vs 14.7 Gbps) if we use page_pool fragment or page_pool full page. Am I missing something? What we are discussing with Jesper is try to allocate a order 3 page from the pool and rely page_pool fragment, similar to page_frag_cache is doing. I will look into it if there are no strong 'red flags'. Regards, Lorenzo > > + page_pool_set_max_frag_size(rq->page_pool, PAGE_SIZE / 2); > + > return 0; > } >
On Wed, 17 May 2023 00:52:25 +0200 Lorenzo Bianconi wrote: > I am testing this RFC patch in the scenario reported below: > > iperf tcp tx --> veth0 --> veth1 (xdp_pass) --> iperf tcp rx > > - 6.4.0-rc1 net-next: > MTU 1500B: ~ 7.07 Gbps > MTU 8000B: ~ 14.7 Gbps > > - 6.4.0-rc1 net-next + page_pool frag support in veth: > MTU 1500B: ~ 8.57 Gbps > MTU 8000B: ~ 14.5 Gbps > > side note: it seems there is a regression between 6.2.15 and 6.4.0-rc1 net-next > (even without latest veth page_pool patches) in the throughput I can get in the > scenario above, but I have not looked into it yet. > > - 6.2.15: > MTU 1500B: ~ 7.91 Gbps > MTU 8000B: ~ 14.1 Gbps > > - 6.4.0-rc1 net-next w/o commits [0],[1],[2] > MTU 1500B: ~ 6.38 Gbps > MTU 8000B: ~ 13.2 Gbps If the benchmark is iperf, wouldn't working towards preserving GSO status across XDP (assuming prog is multi-buf-capable) be the most beneficial optimization?
On 2023/5/17 22:17, Lorenzo Bianconi wrote: >> Maybe using the new frag interface introduced in [1] bring >> back the performance for the MTU 8000B case. >> >> 1. https://patchwork.kernel.org/project/netdevbpf/cover/20230516124801.2465-1-linyunsheng@huawei.com/ >> >> >> I drafted a patch for veth to use the new frag interface, maybe that >> will show how veth can make use of it. Would you give it a try to see >> if there is any performance improvment for MTU 8000B case? Thanks. >> >> --- a/drivers/net/veth.c >> +++ b/drivers/net/veth.c >> @@ -737,8 +737,8 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, >> skb_shinfo(skb)->nr_frags || >> skb_headroom(skb) < XDP_PACKET_HEADROOM) { >> u32 size, len, max_head_size, off; >> + struct page_pool_frag *pp_frag; >> struct sk_buff *nskb; >> - struct page *page; >> int i, head_off; >> >> /* We need a private copy of the skb and data buffers since >> @@ -752,14 +752,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, >> if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) >> goto drop; >> >> + size = min_t(u32, skb->len, max_head_size); >> + size += VETH_XDP_HEADROOM; >> + size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); >> + >> /* Allocate skb head */ >> - page = page_pool_dev_alloc_pages(rq->page_pool); >> - if (!page) >> + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); >> + if (!pp_frag) >> goto drop; >> >> - nskb = napi_build_skb(page_address(page), PAGE_SIZE); >> + nskb = napi_build_skb(page_address(pp_frag->page) + pp_frag->offset, >> + pp_frag->truesize); >> if (!nskb) { >> - page_pool_put_full_page(rq->page_pool, page, true); >> + page_pool_put_full_page(rq->page_pool, pp_frag->page, >> + true); >> goto drop; >> } >> >> @@ -782,16 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, >> len = skb->len - off; >> >> for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { >> - page = page_pool_dev_alloc_pages(rq->page_pool); >> - if (!page) { >> + size = min_t(u32, len, PAGE_SIZE); >> + >> + pp_frag = page_pool_dev_alloc_frag(rq->page_pool, size); >> + if (!pp_frag) { >> consume_skb(nskb); >> goto drop; >> } >> >> - size = min_t(u32, len, PAGE_SIZE); >> - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); >> - if (skb_copy_bits(skb, off, page_address(page), >> - size)) { >> + skb_add_rx_frag(nskb, i, pp_frag->page, pp_frag->offset, >> + size, pp_frag->truesize); >> + if (skb_copy_bits(skb, off, page_address(pp_frag->page) + >> + pp_frag->offset, size)) { >> consume_skb(nskb); >> goto drop; >> } >> @@ -1047,6 +1055,8 @@ static int veth_create_page_pool(struct veth_rq *rq) >> return err; >> } > > IIUC the code here we are using a variable length for linear part (at most one page) > while we will always use a full page (exept for the last fragment) for the paged More correctly, it does not care if the data is in linear part or in paged area. We copy the data to new skb using least possible fragment and most memory saving depending on head/tail room size and the page size/order, as skb_copy_bits() hides the date layout differenence for it's caller. > area, correct? I have not tested it yet but I do not think we will get a significant > improvement since if we set MTU to 8000B in my tests we get mostly the same throughput > (14.5 Gbps vs 14.7 Gbps) if we use page_pool fragment or page_pool full page. > Am I missing something? I don't expect significant improvement too, but I do expect a 'nice improvement' for performance and memory saving depending on how you view 'nice improvement':) > What we are discussing with Jesper is try to allocate a order 3 page from the pool and > rely page_pool fragment, similar to page_frag_cache is doing. I will look into it if > there are no strong 'red flags'. Thanks. Yes, if we do not really care about memory usage, using order 3 page should give more performance improvement. As my understanding, improvement mentioned above is also applied to order 3 page. > > Regards, > Lorenzo > >> >> + page_pool_set_max_frag_size(rq->page_pool, PAGE_SIZE / 2); >> + >> return 0; >> } >>
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 614f3e3efab0..0e648703cccf 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -31,9 +31,12 @@ #define DRV_NAME "veth" #define DRV_VERSION "1.0" -#define VETH_XDP_FLAG BIT(0) -#define VETH_RING_SIZE 256 -#define VETH_XDP_HEADROOM (XDP_PACKET_HEADROOM + NET_IP_ALIGN) +#define VETH_XDP_FLAG BIT(0) +#define VETH_RING_SIZE 256 +#define VETH_XDP_PACKET_HEADROOM 192 +#define VETH_XDP_HEADROOM (VETH_XDP_PACKET_HEADROOM + \ + NET_IP_ALIGN) +#define VETH_PAGE_POOL_FRAG_SIZE 2048 #define VETH_XDP_TX_BULK_SIZE 16 #define VETH_XDP_BATCH 16 @@ -736,7 +739,7 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, if (skb_shared(skb) || skb_head_is_locked(skb) || skb_shinfo(skb)->nr_frags || skb_headroom(skb) < XDP_PACKET_HEADROOM) { - u32 size, len, max_head_size, off; + u32 size, len, max_head_size, off, pp_off; struct sk_buff *nskb; struct page *page; int i, head_off; @@ -747,17 +750,20 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, * * Make sure we have enough space for linear and paged area */ - max_head_size = SKB_WITH_OVERHEAD(PAGE_SIZE - + max_head_size = SKB_WITH_OVERHEAD(VETH_PAGE_POOL_FRAG_SIZE - VETH_XDP_HEADROOM); - if (skb->len > PAGE_SIZE * MAX_SKB_FRAGS + max_head_size) + if (skb->len > + VETH_PAGE_POOL_FRAG_SIZE * MAX_SKB_FRAGS + max_head_size) goto drop; /* Allocate skb head */ - page = page_pool_dev_alloc_pages(rq->page_pool); + page = page_pool_dev_alloc_frag(rq->page_pool, &pp_off, + VETH_PAGE_POOL_FRAG_SIZE); if (!page) goto drop; - nskb = napi_build_skb(page_address(page), PAGE_SIZE); + nskb = napi_build_skb(page_address(page) + pp_off, + VETH_PAGE_POOL_FRAG_SIZE); if (!nskb) { page_pool_put_full_page(rq->page_pool, page, true); goto drop; @@ -782,15 +788,18 @@ static int veth_convert_skb_to_xdp_buff(struct veth_rq *rq, len = skb->len - off; for (i = 0; i < MAX_SKB_FRAGS && off < skb->len; i++) { - page = page_pool_dev_alloc_pages(rq->page_pool); + page = page_pool_dev_alloc_frag(rq->page_pool, &pp_off, + VETH_PAGE_POOL_FRAG_SIZE); if (!page) { consume_skb(nskb); goto drop; } - size = min_t(u32, len, PAGE_SIZE); - skb_add_rx_frag(nskb, i, page, 0, size, PAGE_SIZE); - if (skb_copy_bits(skb, off, page_address(page), + size = min_t(u32, len, VETH_PAGE_POOL_FRAG_SIZE); + skb_add_rx_frag(nskb, i, page, pp_off, size, + VETH_PAGE_POOL_FRAG_SIZE); + if (skb_copy_bits(skb, off, + page_address(page) + pp_off, size)) { consume_skb(nskb); goto drop; @@ -1035,6 +1044,7 @@ static int veth_create_page_pool(struct veth_rq *rq) struct page_pool_params pp_params = { .order = 0, .pool_size = VETH_RING_SIZE, + .flags = PP_FLAG_PAGE_FRAG, .nid = NUMA_NO_NODE, .dev = &rq->dev->dev, }; @@ -1634,13 +1644,14 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog, goto err; } - max_mtu = SKB_WITH_OVERHEAD(PAGE_SIZE - VETH_XDP_HEADROOM) - + max_mtu = SKB_WITH_OVERHEAD(VETH_PAGE_POOL_FRAG_SIZE - + VETH_XDP_HEADROOM) - peer->hard_header_len; /* Allow increasing the max_mtu if the program supports * XDP fragments. */ if (prog->aux->xdp_has_frags) - max_mtu += PAGE_SIZE * MAX_SKB_FRAGS; + max_mtu += VETH_PAGE_POOL_FRAG_SIZE * MAX_SKB_FRAGS; if (peer->mtu > max_mtu) { NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to set XDP");
In order to reduce page_pool memory footprint, rely on page_pool_dev_alloc_frag routine and reduce buffer size (VETH_PAGE_POOL_FRAG_SIZE) to PAGE_SIZE / 2 in order to consume one page for two 1500B frames. Reduce VETH_XDP_PACKET_HEADROOM to 192 from 256 (XDP_PACKET_HEADROOM) to fit max_head_size in VETH_PAGE_POOL_FRAG_SIZE. Please note, using default values (CONFIG_MAX_SKB_FRAGS=17), maximum supported MTU is now reduced to 36350B. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> --- drivers/net/veth.c | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-)