diff mbox series

[bpf] xsk: mark napi_id on sendmsg()

Message ID 20220629105752.933839-1-maciej.fijalkowski@intel.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series [bpf] xsk: mark napi_id on sendmsg() | expand

Checks

Context Check Description
bpf/vmtest-bpf-VM_Test-2 pending Logs for Kernel LATEST on ubuntu-latest with llvm-15
netdev/tree_selection success Clearly marked for bpf
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 7 maintainers not CCed: hawk@kernel.org pabeni@redhat.com jonathan.lemon@gmail.com davem@davemloft.net edumazet@google.com kuba@kernel.org john.fastabend@gmail.com
netdev/build_clang success Errors and warnings before: 6 this patch: 6
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch warning WARNING: line length of 88 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-VM_Test-3 success Logs for Kernel LATEST on z15 with gcc
bpf/vmtest-bpf-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-PR fail PR summary

Commit Message

Fijalkowski, Maciej June 29, 2022, 10:57 a.m. UTC
When application runs in zero copy busy poll mode and does not receive a
single packet but only sends them, it is currently impossible to get
into napi_busy_loop() as napi_id is only marked on Rx side in
xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info
carried by xdp_buff. From Tx perspective, we do not have access to it.
What we have handy is the xsk pool.

Xsk pool works on a pool of internal xdp_buff wrappers called
xdp_buff_xsk. AF_XDP ZC enabled drivers call xp_set_rxq_info() so each
of xdp_buff_xsk has a valid pointer to xdp_rxq_info of underlying queue.
Therefore, on Tx side, napi_id can be pulled from
xs->pool->heads[0].xdp.rxq->napi_id.

Do this only for sockets working in ZC mode as otherwise rxq pointers
would not be initialized.

Fixes: a0731952d9cd ("xsk: Add busy-poll support for {recv,send}msg()")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Björn Töpel June 29, 2022, 12:45 p.m. UTC | #1
On Wed, 29 Jun 2022 at 12:58, Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> When application runs in zero copy busy poll mode and does not receive a
> single packet but only sends them, it is currently impossible to get
> into napi_busy_loop() as napi_id is only marked on Rx side in
> xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info
> carried by xdp_buff. From Tx perspective, we do not have access to it.
> What we have handy is the xsk pool.

The fact that the napi_id is not set unless set from the ingress side
is actually "by design". It's CONFIG_NET_RX_BUSY_POLL after all. I
followed the semantics of the regular busy-polling sockets. So, I
wouldn't say it's a fix! The busy-polling in sendmsg is really just
about "driving the RX busy-polling from another socket syscall".

That being said, I definitely see that this is useful for AF_XDP
sockets, but keep in mind that it sort of changes the behavior from
regular sockets. And we'll get different behavior for
copy-mode/zero-copy mode.

TL;DR, I think it's a good addition. One small nit below:

> +                       __sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);

Please hide this hideous pointer chasing in something neater:
xsk_pool_get_napi_id() or something.


Björn
Fijalkowski, Maciej June 29, 2022, 12:53 p.m. UTC | #2
On Wed, Jun 29, 2022 at 02:45:11PM +0200, Björn Töpel wrote:
> On Wed, 29 Jun 2022 at 12:58, Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > When application runs in zero copy busy poll mode and does not receive a
> > single packet but only sends them, it is currently impossible to get
> > into napi_busy_loop() as napi_id is only marked on Rx side in
> > xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info
> > carried by xdp_buff. From Tx perspective, we do not have access to it.
> > What we have handy is the xsk pool.
> 
> The fact that the napi_id is not set unless set from the ingress side
> is actually "by design". It's CONFIG_NET_RX_BUSY_POLL after all. I
> followed the semantics of the regular busy-polling sockets. So, I
> wouldn't say it's a fix! The busy-polling in sendmsg is really just
> about "driving the RX busy-polling from another socket syscall".

I just felt that busy polling for txonly apps was broken, hence the
'fixing' flavour. I can send it just as improvement to bpf-next.

> 
> That being said, I definitely see that this is useful for AF_XDP
> sockets, but keep in mind that it sort of changes the behavior from
> regular sockets. And we'll get different behavior for
> copy-mode/zero-copy mode.
> 
> TL;DR, I think it's a good addition. One small nit below:
> 
> > +                       __sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);
> 
> Please hide this hideous pointer chasing in something neater:
> xsk_pool_get_napi_id() or something.

Would it make sense to introduce napi_id to xsk_buff_pool then?
xp_set_rxq_info() could be setting it. We are sure that napi_id is the
same for whole pool (each xdp_buff_xsk's rxq info).

> 
> 
> Björn
Magnus Karlsson June 29, 2022, 1:18 p.m. UTC | #3
On Wed, Jun 29, 2022 at 2:58 PM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Jun 29, 2022 at 02:45:11PM +0200, Björn Töpel wrote:
> > On Wed, 29 Jun 2022 at 12:58, Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > When application runs in zero copy busy poll mode and does not receive a
> > > single packet but only sends them, it is currently impossible to get
> > > into napi_busy_loop() as napi_id is only marked on Rx side in
> > > xsk_rcv_check(). In there, napi_id is being taken from xdp_rxq_info
> > > carried by xdp_buff. From Tx perspective, we do not have access to it.
> > > What we have handy is the xsk pool.
> >
> > The fact that the napi_id is not set unless set from the ingress side
> > is actually "by design". It's CONFIG_NET_RX_BUSY_POLL after all. I
> > followed the semantics of the regular busy-polling sockets. So, I
> > wouldn't say it's a fix! The busy-polling in sendmsg is really just
> > about "driving the RX busy-polling from another socket syscall".
>
> I just felt that busy polling for txonly apps was broken, hence the
> 'fixing' flavour. I can send it just as improvement to bpf-next.
>
> >
> > That being said, I definitely see that this is useful for AF_XDP
> > sockets, but keep in mind that it sort of changes the behavior from
> > regular sockets. And we'll get different behavior for
> > copy-mode/zero-copy mode.
> >
> > TL;DR, I think it's a good addition. One small nit below:
> >
> > > +                       __sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);
> >
> > Please hide this hideous pointer chasing in something neater:
> > xsk_pool_get_napi_id() or something.
>
> Would it make sense to introduce napi_id to xsk_buff_pool then?

Only if it has a positive performance impact. Let us not carry copies
of state otherwise. So please measure first and see if it makes any
difference. If not, I prefer the pointer chasing hidden behind an API
like Björn suggests.

> xp_set_rxq_info() could be setting it. We are sure that napi_id is the
> same for whole pool (each xdp_buff_xsk's rxq info).
>
> >
> >
> > Björn
Björn Töpel June 29, 2022, 1:39 p.m. UTC | #4
On Wed, 29 Jun 2022 at 14:45, Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> TL;DR, I think it's a good addition. One small nit below:
>

I forgot one thing. Setting napi_id should be gated by
"CONFIG_NET_RX_BUSY_POLL" ifdefs.
Fijalkowski, Maciej June 29, 2022, 2:28 p.m. UTC | #5
On Wed, Jun 29, 2022 at 03:39:57PM +0200, Björn Töpel wrote:
> On Wed, 29 Jun 2022 at 14:45, Björn Töpel <bjorn.topel@gmail.com> wrote:
> >
> > TL;DR, I think it's a good addition. One small nit below:
> >
> 
> I forgot one thing. Setting napi_id should be gated by
> "CONFIG_NET_RX_BUSY_POLL" ifdefs.

napi id marking functions are wrapped inside with this ifdef
Jakub Kicinski June 29, 2022, 4:16 p.m. UTC | #6
On Wed, 29 Jun 2022 14:53:34 +0200 Maciej Fijalkowski wrote:
> > > +                       __sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);  
> > 
> > Please hide this hideous pointer chasing in something neater:
> > xsk_pool_get_napi_id() or something.  
> 
> Would it make sense to introduce napi_id to xsk_buff_pool then?
> xp_set_rxq_info() could be setting it. We are sure that napi_id is the
> same for whole pool (each xdp_buff_xsk's rxq info).

Would it be possible to move the marking to when the queue is getting
bound instead of the recv/send paths?
Jakub Kicinski June 29, 2022, 4:17 p.m. UTC | #7
On Wed, 29 Jun 2022 09:16:29 -0700 Jakub Kicinski wrote:
> > Would it make sense to introduce napi_id to xsk_buff_pool then?
> > xp_set_rxq_info() could be setting it. We are sure that napi_id is the
> > same for whole pool (each xdp_buff_xsk's rxq info).  
> 
> Would it be possible to move the marking to when the queue is getting
> bound instead of the recv/send paths? 

I mean when socket is getting bound.
Fijalkowski, Maciej June 30, 2022, 11:53 a.m. UTC | #8
On Wed, Jun 29, 2022 at 05:17:07PM +0100, Jakub Kicinski wrote:
> On Wed, 29 Jun 2022 09:16:29 -0700 Jakub Kicinski wrote:
> > > Would it make sense to introduce napi_id to xsk_buff_pool then?
> > > xp_set_rxq_info() could be setting it. We are sure that napi_id is the
> > > same for whole pool (each xdp_buff_xsk's rxq info).  
> > 
> > Would it be possible to move the marking to when the queue is getting
> > bound instead of the recv/send paths? 
> 
> I mean when socket is getting bound.

So Bjorn said that it was the design choice to follow the standard
sockets' approach. I'm including a dirty diff for a discussion which
allows me to get napi_id at bind() time. But, this works for ice as this
driver during the XDP prog/XSK pool load only disables the NAPI, so we are
sure that napi_id stays the same. That might not be the case for other
AF_XDP ZC enabled drivers though, they might delete the NAPI and this
approach wouldn't work...or am I missing something?

I'd prefer the diff below though as it simplifies the data path, but I
can't say if it's safe to do so. We would have to be sure about drivers
keeping their NAPI struct. This would also allow us to drop napi_id from
xdp_rxq_info.

Thoughts?

diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 49ba8bfdbf04..3d084558628e 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -312,6 +312,7 @@ ice_xsk_pool_enable(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
 		return err;
 
 	set_bit(qid, vsi->af_xdp_zc_qps);
+	xsk_pool_set_napi_id(pool, vsi->rx_rings[qid]->q_vector->napi.napi_id);
 
 	return 0;
 }
@@ -348,7 +349,6 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
 
 	pool_failure = pool_present ? ice_xsk_pool_enable(vsi, pool, qid) :
 				      ice_xsk_pool_disable(vsi, qid);
-
 xsk_pool_if_up:
 	if (if_running) {
 		ret = ice_qp_ena(vsi, qid);
@@ -358,6 +358,7 @@ int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid)
 			netdev_err(vsi->netdev, "ice_qp_ena error = %d\n", ret);
 	}
 
+
 failure:
 	if (pool_failure) {
 		netdev_err(vsi->netdev, "Could not %sable buffer pool, error = %d\n",
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4aa031849668..1a20320cd556 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -44,6 +44,14 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
 	xp_set_rxq_info(pool, rxq);
 }
 
+static inline void xsk_pool_set_napi_id(struct xsk_buff_pool *pool,
+					unsigned int napi_id)
+{
+#ifdef CONFIG_NET_RX_BUSY_POLL
+	xp_set_napi_id(pool, napi_id);
+#endif
+}
+
 static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool,
 				      unsigned long attrs)
 {
@@ -198,6 +206,11 @@ static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool,
 {
 }
 
+static inline void xsk_pool_set_napi_id(struct xsk_buff_pool *pool,
+					unsigned int napi_id)
+{
+}
+
 static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool,
 				      unsigned long attrs)
 {
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 647722e847b4..60775a8d1bcb 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -70,6 +70,7 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	unsigned int napi_id;
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool dma_need_sync;
@@ -125,6 +126,7 @@ static inline void xp_init_xskb_dma(struct xdp_buff_xsk *xskb, struct xsk_buff_p
 
 /* AF_XDP ZC drivers, via xdp_sock_buff.h */
 void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq);
+void xp_set_napi_id(struct xsk_buff_pool *pool, unsigned int napi_id);
 int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
 	       unsigned long attrs, struct page **pages, u32 nr_pages);
 void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index eafd512d38b1..18ac3f32a48d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -222,7 +222,6 @@ static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp)
 	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
 		return -EINVAL;
 
-	sk_mark_napi_id_once_xdp(&xs->sk, xdp);
 	return 0;
 }
 
@@ -637,11 +636,8 @@ static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len
 	if (unlikely(need_wait))
 		return -EOPNOTSUPP;
 
-	if (sk_can_busy_loop(sk)) {
-		if (xs->zc)
-			__sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);
+	if (sk_can_busy_loop(sk))
 		sk_busy_loop(sk, 1); /* only support non-blocking sockets */
-	}
 
 	if (xs->zc && xsk_no_wakeup(sk))
 		return 0;
@@ -1015,6 +1011,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->dev = dev;
 	xs->zc = xs->umem->zc;
 	xs->queue_id = qid;
+	if (xs->zc)
+		xs->sk.sk_napi_id = xs->pool->napi_id;
 	xp_add_xsk(xs->pool, xs);
 
 out_unlock:
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 87bdd71c7bb6..5ec6443be3fc 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -121,6 +121,14 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq)
 }
 EXPORT_SYMBOL(xp_set_rxq_info);
 
+#ifdef CONFIG_NET_RX_BUSY_POLL
+void xp_set_napi_id(struct xsk_buff_pool *pool, unsigned int napi_id)
+{
+	pool->napi_id = napi_id;
+}
+EXPORT_SYMBOL(xp_set_napi_id);
+#endif
+
 static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 {
 	struct netdev_bpf bpf;
Jakub Kicinski June 30, 2022, 3:52 p.m. UTC | #9
On Thu, 30 Jun 2022 13:53:11 +0200 Maciej Fijalkowski wrote:
> > > Would it be possible to move the marking to when the queue is getting
> > > bound instead of the recv/send paths?   
> > 
> > I mean when socket is getting bound.  
> 
> So Bjorn said that it was the design choice to follow the standard
> sockets' approach. I'm including a dirty diff for a discussion which
> allows me to get napi_id at bind() time. But, this works for ice as this
> driver during the XDP prog/XSK pool load only disables the NAPI, so we are
> sure that napi_id stays the same. That might not be the case for other
> AF_XDP ZC enabled drivers though, they might delete the NAPI and this
> approach wouldn't work...or am I missing something?

Possible, but IDK if we're changing anything substantially in that
regard. The existing code already uses sk_mark_napi_id_once() so
if the NAPI ID changes during the lifetime of the socket the user
will be out of luck already. But no strong feelings.

> I'd prefer the diff below though as it simplifies the data path, but I
> can't say if it's safe to do so. We would have to be sure about drivers
> keeping their NAPI struct. This would also allow us to drop napi_id from
> xdp_rxq_info.
diff mbox series

Patch

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 19ac872a6624..eafd512d38b1 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -637,8 +637,11 @@  static int __xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len
 	if (unlikely(need_wait))
 		return -EOPNOTSUPP;
 
-	if (sk_can_busy_loop(sk))
+	if (sk_can_busy_loop(sk)) {
+		if (xs->zc)
+			__sk_mark_napi_id_once(sk, xs->pool->heads[0].xdp.rxq->napi_id);
 		sk_busy_loop(sk, 1); /* only support non-blocking sockets */
+	}
 
 	if (xs->zc && xsk_no_wakeup(sk))
 		return 0;