diff mbox series

[net] veth: Avoid drop packets when xdp_redirect performs

Message ID 1664267413-75518-1-git-send-email-hengqi@linux.alibaba.com (mailing list archive)
State Rejected
Delegated to: Netdev Maintainers
Headers show
Series [net] veth: Avoid drop packets when xdp_redirect performs | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers warning 5 maintainers not CCed: john.fastabend@gmail.com ast@kernel.org hawk@kernel.org bpf@vger.kernel.org daniel@iogearbox.net
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 95 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Heng Qi Sept. 27, 2022, 8:30 a.m. UTC
In the current processing logic, when xdp_redirect occurs, it transmits
the xdp frame based on napi.

If napi of the peer veth is not ready, the veth will drop the packets.
This doesn't meet our expectations.

In this context, if napi is not ready, we convert the xdp frame to a skb,
and then use veth_xmit() to deliver it to the peer veth.

Like the following case:
Even if veth1's napi cannot be used, the packet redirected from the NIC
will be transmitted to veth1 successfully:

NIC   ->   veth0----veth1
 |                   |
(XDP)             (no XDP)

Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

Comments

Toke Høiland-Jørgensen Sept. 27, 2022, 12:20 p.m. UTC | #1
Heng Qi <hengqi@linux.alibaba.com> writes:

> In the current processing logic, when xdp_redirect occurs, it transmits
> the xdp frame based on napi.
>
> If napi of the peer veth is not ready, the veth will drop the packets.
> This doesn't meet our expectations.

Erm, why don't you just enable NAPI? Loading an XDP program is not
needed these days, you can just enable GRO on both peers...

> In this context, if napi is not ready, we convert the xdp frame to a skb,
> and then use veth_xmit() to deliver it to the peer veth.
>
> Like the following case:
> Even if veth1's napi cannot be used, the packet redirected from the NIC
> will be transmitted to veth1 successfully:
>
> NIC   ->   veth0----veth1
>  |                   |
> (XDP)             (no XDP)
>
> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 466da01..e1f5561 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>  	/* The napi pointer is set if NAPI is enabled, which ensures that
>  	 * xdp_ring is initialized on receive side and the peer device is up.
>  	 */
> -	if (!rcu_access_pointer(rq->napi))
> +	if (!rcu_access_pointer(rq->napi)) {
> +		for (i = 0; i < n; i++) {
> +			struct xdp_frame *xdpf = frames[i];
> +			struct netdev_queue *txq = NULL;
> +			struct sk_buff *skb;
> +			int queue_mapping;
> +			u16 mac_len;
> +
> +			skb = xdp_build_skb_from_frame(xdpf, dev);
> +			if (unlikely(!skb)) {
> +				ret = nxmit;
> +				goto out;
> +			}
> +
> +			/* We need to restore ETH header, because it is pulled
> +			 * in eth_type_trans.
> +			 */
> +			mac_len = skb->data - skb_mac_header(skb);
> +			skb_push(skb, mac_len);
> +
> +			nxmit++;
> +
> +			queue_mapping = skb_get_queue_mapping(skb);
> +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
> +			__netif_tx_lock(txq, smp_processor_id());
> +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
> +				__netif_tx_unlock(txq);
> +				ret = nxmit;
> +				goto out;
> +			}
> +			__netif_tx_unlock(txq);

Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
measure the performance overhead of this?

-Toke
Heng Qi Sept. 28, 2022, 11:57 a.m. UTC | #2
在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
> Heng Qi <hengqi@linux.alibaba.com> writes:
>
>> In the current processing logic, when xdp_redirect occurs, it transmits
>> the xdp frame based on napi.
>>
>> If napi of the peer veth is not ready, the veth will drop the packets.
>> This doesn't meet our expectations.
> Erm, why don't you just enable NAPI? Loading an XDP program is not
> needed these days, you can just enable GRO on both peers...

In general, we don't expect veth to drop packets when it doesn't mount
the xdp program or otherwise, because this is not as expected.

>> In this context, if napi is not ready, we convert the xdp frame to a skb,
>> and then use veth_xmit() to deliver it to the peer veth.
>>
>> Like the following case:
>> Even if veth1's napi cannot be used, the packet redirected from the NIC
>> will be transmitted to veth1 successfully:
>>
>> NIC   ->   veth0----veth1
>>   |                   |
>> (XDP)             (no XDP)
>>
>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>> ---
>>   drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 466da01..e1f5561 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>   	/* The napi pointer is set if NAPI is enabled, which ensures that
>>   	 * xdp_ring is initialized on receive side and the peer device is up.
>>   	 */
>> -	if (!rcu_access_pointer(rq->napi))
>> +	if (!rcu_access_pointer(rq->napi)) {
>> +		for (i = 0; i < n; i++) {
>> +			struct xdp_frame *xdpf = frames[i];
>> +			struct netdev_queue *txq = NULL;
>> +			struct sk_buff *skb;
>> +			int queue_mapping;
>> +			u16 mac_len;
>> +
>> +			skb = xdp_build_skb_from_frame(xdpf, dev);
>> +			if (unlikely(!skb)) {
>> +				ret = nxmit;
>> +				goto out;
>> +			}
>> +
>> +			/* We need to restore ETH header, because it is pulled
>> +			 * in eth_type_trans.
>> +			 */
>> +			mac_len = skb->data - skb_mac_header(skb);
>> +			skb_push(skb, mac_len);
>> +
>> +			nxmit++;
>> +
>> +			queue_mapping = skb_get_queue_mapping(skb);
>> +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
>> +			__netif_tx_lock(txq, smp_processor_id());
>> +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
>> +				__netif_tx_unlock(txq);
>> +				ret = nxmit;
>> +				goto out;
>> +			}
>> +			__netif_tx_unlock(txq);
> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
> measure the performance overhead of this?

Yes, there are indeed some optimizations that can be done here,
like putting the lock outside the loop.
But in __dev_queue_xmit(), where each packet sent is also protected by a lock.

Thanks.

>
> -Toke
Toke Høiland-Jørgensen Sept. 28, 2022, 2:58 p.m. UTC | #3
Heng Qi <hengqi@linux.alibaba.com> writes:

> 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>
>>> In the current processing logic, when xdp_redirect occurs, it transmits
>>> the xdp frame based on napi.
>>>
>>> If napi of the peer veth is not ready, the veth will drop the packets.
>>> This doesn't meet our expectations.
>> Erm, why don't you just enable NAPI? Loading an XDP program is not
>> needed these days, you can just enable GRO on both peers...
>
> In general, we don't expect veth to drop packets when it doesn't mount
> the xdp program or otherwise, because this is not as expected.

Well, did you consider that maybe your expectation is wrong? ;)

>>> In this context, if napi is not ready, we convert the xdp frame to a skb,
>>> and then use veth_xmit() to deliver it to the peer veth.
>>>
>>> Like the following case:
>>> Even if veth1's napi cannot be used, the packet redirected from the NIC
>>> will be transmitted to veth1 successfully:
>>>
>>> NIC   ->   veth0----veth1
>>>   |                   |
>>> (XDP)             (no XDP)
>>>
>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>   drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 466da01..e1f5561 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>   	/* The napi pointer is set if NAPI is enabled, which ensures that
>>>   	 * xdp_ring is initialized on receive side and the peer device is up.
>>>   	 */
>>> -	if (!rcu_access_pointer(rq->napi))
>>> +	if (!rcu_access_pointer(rq->napi)) {
>>> +		for (i = 0; i < n; i++) {
>>> +			struct xdp_frame *xdpf = frames[i];
>>> +			struct netdev_queue *txq = NULL;
>>> +			struct sk_buff *skb;
>>> +			int queue_mapping;
>>> +			u16 mac_len;
>>> +
>>> +			skb = xdp_build_skb_from_frame(xdpf, dev);
>>> +			if (unlikely(!skb)) {
>>> +				ret = nxmit;
>>> +				goto out;
>>> +			}
>>> +
>>> +			/* We need to restore ETH header, because it is pulled
>>> +			 * in eth_type_trans.
>>> +			 */
>>> +			mac_len = skb->data - skb_mac_header(skb);
>>> +			skb_push(skb, mac_len);
>>> +
>>> +			nxmit++;
>>> +
>>> +			queue_mapping = skb_get_queue_mapping(skb);
>>> +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
>>> +			__netif_tx_lock(txq, smp_processor_id());
>>> +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
>>> +				__netif_tx_unlock(txq);
>>> +				ret = nxmit;
>>> +				goto out;
>>> +			}
>>> +			__netif_tx_unlock(txq);
>> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
>> measure the performance overhead of this?
>
> Yes, there are indeed some optimizations that can be done here,
> like putting the lock outside the loop.
> But in __dev_queue_xmit(), where each packet sent is also protected by a lock.

...which is another reason why this is a bad idea: it's going to perform
terribly, which means we'll just end up with users wondering why their
XDP performance is terrible and we're going to have to tell them to turn
on GRO anyway. So why not do this from the beginning?

If you want to change the default, flipping GRO to be on by default is a
better solution IMO. I don't actually recall why we didn't do that when
the support was added, but maybe Paolo remembers?

-Toke
Heng Qi Sept. 29, 2022, 2:50 a.m. UTC | #4
在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道:
> Heng Qi <hengqi@linux.alibaba.com> writes:
>
>> 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
>>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>>
>>>> In the current processing logic, when xdp_redirect occurs, it transmits
>>>> the xdp frame based on napi.
>>>>
>>>> If napi of the peer veth is not ready, the veth will drop the packets.
>>>> This doesn't meet our expectations.
>>> Erm, why don't you just enable NAPI? Loading an XDP program is not
>>> needed these days, you can just enable GRO on both peers...
>> In general, we don't expect veth to drop packets when it doesn't mount
>> the xdp program or otherwise, because this is not as expected.
> Well, did you consider that maybe your expectation is wrong? ;)

For users who don't know what other conditions are required for the readiness of napi,
all they can observe is why the packets cannot be sent to the peer veth, which is also
the problem we encountered in the actual case scenarios.


>
>>>> In this context, if napi is not ready, we convert the xdp frame to a skb,
>>>> and then use veth_xmit() to deliver it to the peer veth.
>>>>
>>>> Like the following case:
>>>> Even if veth1's napi cannot be used, the packet redirected from the NIC
>>>> will be transmitted to veth1 successfully:
>>>>
>>>> NIC   ->   veth0----veth1
>>>>    |                   |
>>>> (XDP)             (no XDP)
>>>>
>>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>> ---
>>>>    drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 35 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 466da01..e1f5561 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>    	/* The napi pointer is set if NAPI is enabled, which ensures that
>>>>    	 * xdp_ring is initialized on receive side and the peer device is up.
>>>>    	 */
>>>> -	if (!rcu_access_pointer(rq->napi))
>>>> +	if (!rcu_access_pointer(rq->napi)) {
>>>> +		for (i = 0; i < n; i++) {
>>>> +			struct xdp_frame *xdpf = frames[i];
>>>> +			struct netdev_queue *txq = NULL;
>>>> +			struct sk_buff *skb;
>>>> +			int queue_mapping;
>>>> +			u16 mac_len;
>>>> +
>>>> +			skb = xdp_build_skb_from_frame(xdpf, dev);
>>>> +			if (unlikely(!skb)) {
>>>> +				ret = nxmit;
>>>> +				goto out;
>>>> +			}
>>>> +
>>>> +			/* We need to restore ETH header, because it is pulled
>>>> +			 * in eth_type_trans.
>>>> +			 */
>>>> +			mac_len = skb->data - skb_mac_header(skb);
>>>> +			skb_push(skb, mac_len);
>>>> +
>>>> +			nxmit++;
>>>> +
>>>> +			queue_mapping = skb_get_queue_mapping(skb);
>>>> +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
>>>> +			__netif_tx_lock(txq, smp_processor_id());
>>>> +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
>>>> +				__netif_tx_unlock(txq);
>>>> +				ret = nxmit;
>>>> +				goto out;
>>>> +			}
>>>> +			__netif_tx_unlock(txq);
>>> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
>>> measure the performance overhead of this?
>> Yes, there are indeed some optimizations that can be done here,
>> like putting the lock outside the loop.
>> But in __dev_queue_xmit(), where each packet sent is also protected by a lock.
> ...which is another reason why this is a bad idea: it's going to perform
> terribly, which means we'll just end up with users wondering why their
> XDP performance is terrible and we're going to have to tell them to turn
> on GRO anyway. So why not do this from the beginning?
>
> If you want to change the default, flipping GRO to be on by default is a
> better solution IMO. I don't actually recall why we didn't do that when
> the support was added, but maybe Paolo remembers?

As I said above in the real case, the user's concern is not why the performance
of xdp becomes bad, but why the data packets are not received.

The default number of veth queues is not num_possible_cpus(). When GRO is enabled
by default, if there is only one veth queue, but multiple CPUs read and write at the
same time, the efficiency of napi is actually very low due to the existence of
production locks and races. On the contrary, the default veth_xmit() each cpu has
its own unique queue, and this way of sending and receiving packets is also efficient.

Thanks.

>
> -Toke
Paolo Abeni Sept. 29, 2022, 6:57 a.m. UTC | #5
Hello,

On Thu, 2022-09-29 at 10:50 +0800, Heng Qi wrote:
> 在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道:
> > Heng Qi <hengqi@linux.alibaba.com> writes:
> > 
> > > 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
> > > > Heng Qi <hengqi@linux.alibaba.com> writes:
> > > > 
> > > > > In the current processing logic, when xdp_redirect occurs, it transmits
> > > > > the xdp frame based on napi.
> > > > > 
> > > > > If napi of the peer veth is not ready, the veth will drop the packets.
> > > > > This doesn't meet our expectations.
> > > > Erm, why don't you just enable NAPI? Loading an XDP program is not
> > > > needed these days, you can just enable GRO on both peers...
> > > In general, we don't expect veth to drop packets when it doesn't mount
> > > the xdp program or otherwise, because this is not as expected.
> > Well, did you consider that maybe your expectation is wrong? ;)
> 
> For users who don't know what other conditions are required for the readiness of napi,
> all they can observe is why the packets cannot be sent to the peer veth, which is also
> the problem we encountered in the actual case scenarios.
> 
> 
> > 
> > > > > In this context, if napi is not ready, we convert the xdp frame to a skb,
> > > > > and then use veth_xmit() to deliver it to the peer veth.
> > > > > 
> > > > > Like the following case:
> > > > > Even if veth1's napi cannot be used, the packet redirected from the NIC
> > > > > will be transmitted to veth1 successfully:
> > > > > 
> > > > > NIC   ->   veth0----veth1
> > > > >    |                   |
> > > > > (XDP)             (no XDP)
> > > > > 
> > > > > Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
> > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > ---
> > > > >    drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > >    1 file changed, 35 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > > > > index 466da01..e1f5561 100644
> > > > > --- a/drivers/net/veth.c
> > > > > +++ b/drivers/net/veth.c
> > > > > @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> > > > >    	/* The napi pointer is set if NAPI is enabled, which ensures that
> > > > >    	 * xdp_ring is initialized on receive side and the peer device is up.
> > > > >    	 */
> > > > > -	if (!rcu_access_pointer(rq->napi))
> > > > > +	if (!rcu_access_pointer(rq->napi)) {
> > > > > +		for (i = 0; i < n; i++) {
> > > > > +			struct xdp_frame *xdpf = frames[i];
> > > > > +			struct netdev_queue *txq = NULL;
> > > > > +			struct sk_buff *skb;
> > > > > +			int queue_mapping;
> > > > > +			u16 mac_len;
> > > > > +
> > > > > +			skb = xdp_build_skb_from_frame(xdpf, dev);
> > > > > +			if (unlikely(!skb)) {
> > > > > +				ret = nxmit;
> > > > > +				goto out;
> > > > > +			}
> > > > > +
> > > > > +			/* We need to restore ETH header, because it is pulled
> > > > > +			 * in eth_type_trans.
> > > > > +			 */
> > > > > +			mac_len = skb->data - skb_mac_header(skb);
> > > > > +			skb_push(skb, mac_len);
> > > > > +
> > > > > +			nxmit++;
> > > > > +
> > > > > +			queue_mapping = skb_get_queue_mapping(skb);
> > > > > +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
> > > > > +			__netif_tx_lock(txq, smp_processor_id());
> > > > > +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
> > > > > +				__netif_tx_unlock(txq);
> > > > > +				ret = nxmit;
> > > > > +				goto out;
> > > > > +			}
> > > > > +			__netif_tx_unlock(txq);
> > > > Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
> > > > measure the performance overhead of this?
> > > Yes, there are indeed some optimizations that can be done here,
> > > like putting the lock outside the loop.
> > > But in __dev_queue_xmit(), where each packet sent is also protected by a lock.
> > ...which is another reason why this is a bad idea: it's going to perform
> > terribly, which means we'll just end up with users wondering why their
> > XDP performance is terrible and we're going to have to tell them to turn
> > on GRO anyway. So why not do this from the beginning?
> > 
> > If you want to change the default, flipping GRO to be on by default is a
> > better solution IMO. I don't actually recall why we didn't do that when
> > the support was added, but maybe Paolo remembers?

I preferred to avoid changing the default behavior. Additionally, the
veth GRO stage needs some tuning to really be able to  aggregate the
packets (e.g. napi thread or gro_flush_timeout > 0)

> As I said above in the real case, the user's concern is not why the performance
> of xdp becomes bad, but why the data packets are not received.

Well, that arguably tells the end-user there is something wrong in
their setup. On the flip side, having a functionally working setup with
horrible performances would likely lead the users (perhaps not yours,
surely others) in very wrong directions (from "XDP is slow" to "the
problem is in the application")...

> The default number of veth queues is not num_possible_cpus(). When GRO is enabled
> by default, if there is only one veth queue, but multiple CPUs read and write at the
> same time, the efficiency of napi is actually very low due to the existence of
> production locks and races. On the contrary, the default veth_xmit() each cpu has
> its own unique queue, and this way of sending and receiving packets is also efficient.
> 
This patch adds a bit of complexity and it looks completely avoidable
with some configuration - you could enable GRO and set the number of
queues to num_possible_cpus().

I agree with Toke, you should explain the end-users that their
expecations are wrong, and guide them towards a better setup.

Thanks!

Paolo
Heng Qi Sept. 29, 2022, 7:33 a.m. UTC | #6
在 2022/9/29 下午2:57, Paolo Abeni 写道:
> Hello,
>
> On Thu, 2022-09-29 at 10:50 +0800, Heng Qi wrote:
>> 在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道:
>>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>>
>>>> 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
>>>>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>>>>
>>>>>> In the current processing logic, when xdp_redirect occurs, it transmits
>>>>>> the xdp frame based on napi.
>>>>>>
>>>>>> If napi of the peer veth is not ready, the veth will drop the packets.
>>>>>> This doesn't meet our expectations.
>>>>> Erm, why don't you just enable NAPI? Loading an XDP program is not
>>>>> needed these days, you can just enable GRO on both peers...
>>>> In general, we don't expect veth to drop packets when it doesn't mount
>>>> the xdp program or otherwise, because this is not as expected.
>>> Well, did you consider that maybe your expectation is wrong? ;)
>> For users who don't know what other conditions are required for the readiness of napi,
>> all they can observe is why the packets cannot be sent to the peer veth, which is also
>> the problem we encountered in the actual case scenarios.
>>
>>
>>>>>> In this context, if napi is not ready, we convert the xdp frame to a skb,
>>>>>> and then use veth_xmit() to deliver it to the peer veth.
>>>>>>
>>>>>> Like the following case:
>>>>>> Even if veth1's napi cannot be used, the packet redirected from the NIC
>>>>>> will be transmitted to veth1 successfully:
>>>>>>
>>>>>> NIC   ->   veth0----veth1
>>>>>>     |                   |
>>>>>> (XDP)             (no XDP)
>>>>>>
>>>>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
>>>>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>> ---
>>>>>>     drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
>>>>>>     1 file changed, 35 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>> index 466da01..e1f5561 100644
>>>>>> --- a/drivers/net/veth.c
>>>>>> +++ b/drivers/net/veth.c
>>>>>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>>>     	/* The napi pointer is set if NAPI is enabled, which ensures that
>>>>>>     	 * xdp_ring is initialized on receive side and the peer device is up.
>>>>>>     	 */
>>>>>> -	if (!rcu_access_pointer(rq->napi))
>>>>>> +	if (!rcu_access_pointer(rq->napi)) {
>>>>>> +		for (i = 0; i < n; i++) {
>>>>>> +			struct xdp_frame *xdpf = frames[i];
>>>>>> +			struct netdev_queue *txq = NULL;
>>>>>> +			struct sk_buff *skb;
>>>>>> +			int queue_mapping;
>>>>>> +			u16 mac_len;
>>>>>> +
>>>>>> +			skb = xdp_build_skb_from_frame(xdpf, dev);
>>>>>> +			if (unlikely(!skb)) {
>>>>>> +				ret = nxmit;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +
>>>>>> +			/* We need to restore ETH header, because it is pulled
>>>>>> +			 * in eth_type_trans.
>>>>>> +			 */
>>>>>> +			mac_len = skb->data - skb_mac_header(skb);
>>>>>> +			skb_push(skb, mac_len);
>>>>>> +
>>>>>> +			nxmit++;
>>>>>> +
>>>>>> +			queue_mapping = skb_get_queue_mapping(skb);
>>>>>> +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
>>>>>> +			__netif_tx_lock(txq, smp_processor_id());
>>>>>> +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
>>>>>> +				__netif_tx_unlock(txq);
>>>>>> +				ret = nxmit;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +			__netif_tx_unlock(txq);
>>>>> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
>>>>> measure the performance overhead of this?
>>>> Yes, there are indeed some optimizations that can be done here,
>>>> like putting the lock outside the loop.
>>>> But in __dev_queue_xmit(), where each packet sent is also protected by a lock.
>>> ...which is another reason why this is a bad idea: it's going to perform
>>> terribly, which means we'll just end up with users wondering why their
>>> XDP performance is terrible and we're going to have to tell them to turn
>>> on GRO anyway. So why not do this from the beginning?
>>>
>>> If you want to change the default, flipping GRO to be on by default is a
>>> better solution IMO. I don't actually recall why we didn't do that when
>>> the support was added, but maybe Paolo remembers?
> I preferred to avoid changing the default behavior. Additionally, the
> veth GRO stage needs some tuning to really be able to  aggregate the
> packets (e.g. napi thread or gro_flush_timeout > 0)
>
>> As I said above in the real case, the user's concern is not why the performance
>> of xdp becomes bad, but why the data packets are not received.
> Well, that arguably tells the end-user there is something wrong in
> their setup. On the flip side, having a functionally working setup with
> horrible performances would likely lead the users (perhaps not yours,
> surely others) in very wrong directions (from "XDP is slow" to "the
> problem is in the application")...
>
>> The default number of veth queues is not num_possible_cpus(). When GRO is enabled
>> by default, if there is only one veth queue, but multiple CPUs read and write at the
>> same time, the efficiency of napi is actually very low due to the existence of
>> production locks and races. On the contrary, the default veth_xmit() each cpu has
>> its own unique queue, and this way of sending and receiving packets is also efficient.
>>
> This patch adds a bit of complexity and it looks completely avoidable
> with some configuration - you could enable GRO and set the number of
> queues to num_possible_cpus().
>
> I agree with Toke, you should explain the end-users that their
> expecations are wrong, and guide them towards a better setup.
>
> Thanks!

Well, one thing I want to know is that in the following scenario,

NIC   ->   veth0----veth1
  |           |        |
(XDP)      (XDP)    (no XDP)

xdp_redirect is triggered,
and NIC and veth0 are both mounted with the xdp program, then why our default behavior
is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp
program, the napi ring of veth1 is opened by default at the same time? Why not make it like
this, but we must configure a simple xdp program on veth1?

Thanks.

>
> Paolo
Toke Høiland-Jørgensen Sept. 29, 2022, 12:08 p.m. UTC | #7
Heng Qi <hengqi@linux.alibaba.com> writes:

>>> As I said above in the real case, the user's concern is not why the performance
>>> of xdp becomes bad, but why the data packets are not received.
>> Well, that arguably tells the end-user there is something wrong in
>> their setup. On the flip side, having a functionally working setup with
>> horrible performances would likely lead the users (perhaps not yours,
>> surely others) in very wrong directions (from "XDP is slow" to "the
>> problem is in the application")...
>>
>>> The default number of veth queues is not num_possible_cpus(). When GRO is enabled
>>> by default, if there is only one veth queue, but multiple CPUs read and write at the
>>> same time, the efficiency of napi is actually very low due to the existence of
>>> production locks and races. On the contrary, the default veth_xmit() each cpu has
>>> its own unique queue, and this way of sending and receiving packets is also efficient.
>>>
>> This patch adds a bit of complexity and it looks completely avoidable
>> with some configuration - you could enable GRO and set the number of
>> queues to num_possible_cpus().
>>
>> I agree with Toke, you should explain the end-users that their
>> expecations are wrong, and guide them towards a better setup.
>>
>> Thanks!
>
> Well, one thing I want to know is that in the following scenario,
>
> NIC   ->   veth0----veth1
>   |           |        |
> (XDP)      (XDP)    (no XDP)
>
> xdp_redirect is triggered,
> and NIC and veth0 are both mounted with the xdp program, then why our default behavior
> is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp
> program, the napi ring of veth1 is opened by default at the same time? Why not make it like
> this, but we must configure a simple xdp program on veth1?

As I said in my initial reply, you don't actually need to load an XDP
program (anymore), it's enough to enable GRO through ethtool on both
peers. You can easily do this on setup if you know XDP is going to be
used in your environment.

-Toke
Heng Qi Oct. 20, 2022, 2:23 a.m. UTC | #8
在 2022/9/29 下午8:08, Toke Høiland-Jørgensen 写道:
> Heng Qi <hengqi@linux.alibaba.com> writes:
>
>>>> As I said above in the real case, the user's concern is not why the performance
>>>> of xdp becomes bad, but why the data packets are not received.
>>> Well, that arguably tells the end-user there is something wrong in
>>> their setup. On the flip side, having a functionally working setup with
>>> horrible performances would likely lead the users (perhaps not yours,
>>> surely others) in very wrong directions (from "XDP is slow" to "the
>>> problem is in the application")...
>>>
>>>> The default number of veth queues is not num_possible_cpus(). When GRO is enabled
>>>> by default, if there is only one veth queue, but multiple CPUs read and write at the
>>>> same time, the efficiency of napi is actually very low due to the existence of
>>>> production locks and races. On the contrary, the default veth_xmit() each cpu has
>>>> its own unique queue, and this way of sending and receiving packets is also efficient.
>>>>
>>> This patch adds a bit of complexity and it looks completely avoidable
>>> with some configuration - you could enable GRO and set the number of
>>> queues to num_possible_cpus().
>>>
>>> I agree with Toke, you should explain the end-users that their
>>> expecations are wrong, and guide them towards a better setup.
>>>
>>> Thanks!
>> Well, one thing I want to know is that in the following scenario,
>>
>> NIC   ->   veth0----veth1
>>    |           |        |
>> (XDP)      (XDP)    (no XDP)
>>
>> xdp_redirect is triggered,
>> and NIC and veth0 are both mounted with the xdp program, then why our default behavior
>> is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp
>> program, the napi ring of veth1 is opened by default at the same time? Why not make it like
>> this, but we must configure a simple xdp program on veth1?
> As I said in my initial reply, you don't actually need to load an XDP
> program (anymore), it's enough to enable GRO through ethtool on both
> peers. You can easily do this on setup if you know XDP is going to be
> used in your environment.

This does serve our purpose, but in fact, users of veth pair do not necessarily understand
how it works. In this case, for users who are not familiar with veth pair, they may not
know that they need to enable GRO or load the xdp program for peer veth to ensure that
the redirected packets can be received smoothly. In order to solve this overwhelming problem,
they may go to take the time to look at the source code, or even find someone else to solve it,
but we can avoid these with simple modifications (modifications may not be the rollback,
using the backlog instead of the napi ring, made by this patch), for example, maybe we should
consider a simpler method: when loading xdp in veth, we can automatically enable the napi ring
of peer veth, which seems to have no performance impact and functional impact on the veth pair,
and no longer requires users to do more things for peer veth (after all, they may be unaware
of more requirements for peer veth). Do you think this is feasible?

Thanks.

> -Toke
Toke Høiland-Jørgensen Oct. 20, 2022, 4:34 p.m. UTC | #9
Heng Qi <hengqi@linux.alibaba.com> writes:

> maybe we should consider a simpler method: when loading xdp in veth,
> we can automatically enable the napi ring of peer veth, which seems to
> have no performance impact and functional impact on the veth pair, and
> no longer requires users to do more things for peer veth (after all,
> they may be unaware of more requirements for peer veth). Do you think
> this is feasible?

It could be, perhaps? One issue is what to do once the XDP program is
then unloaded? We should probably disable NAPI on the peer in this case,
but then we'd need to track whether it was enabled by loading an XDP
program; we don't want to disable GRO/NAPI if the user requested it
explicitly. This kind of state tracking gets icky fast, so I guess it'll
depend on the patch...

-Toke
Heng Qi Oct. 21, 2022, 6:31 a.m. UTC | #10
在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道:
> Heng Qi <hengqi@linux.alibaba.com> writes:
>
>> maybe we should consider a simpler method: when loading xdp in veth,
>> we can automatically enable the napi ring of peer veth, which seems to
>> have no performance impact and functional impact on the veth pair, and
>> no longer requires users to do more things for peer veth (after all,
>> they may be unaware of more requirements for peer veth). Do you think
>> this is feasible?
> It could be, perhaps? One issue is what to do once the XDP program is
> then unloaded? We should probably disable NAPI on the peer in this case,
> but then we'd need to track whether it was enabled by loading an XDP
> program; we don't want to disable GRO/NAPI if the user requested it
> explicitly. This kind of state tracking gets icky fast, so I guess it'll
> depend on the patch...

Regarding tracking whether we disable the napi of peer veth when unloading
the veth's xdp program, this can actually be handled cleanly.

We need to note that when peer veth enable GRO, the peer veth device will
update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to
__netdev_update_features() and veth_set_features() ).

When veth loads the xdp program and the napi of peer veth is not ready
(that is, peer veth does not load the xdp program or has no enable gro),
at this time, we can choose `ethtool -K veth0 gro on` to enable the napi of
peer veth, this command also makes the peer veth device update their
wanted_features, or choose we automatically enable napi for peer veth.

If we want to unload the xdp program for veth, peer veth cannot directly
disable its napi, because we need to judge whether peer veth is gro_requested
( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just
clean veth's xdp environment and disable the napi of veth instead of
directly disable the napi of peer veth, because of the existence of the
gro_requested and the xdp program loading on peer veth.

But, if peer veth does not have gro_requested or xdp_program loading on itself,
then we can directly disable the napi of peer veth.

Thanks.

> -Toke
Heng Qi Oct. 24, 2022, 11:20 a.m. UTC | #11
在 2022/10/21 下午2:31, Heng Qi 写道:
>
>
> 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道:
>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>
>>> maybe we should consider a simpler method: when loading xdp in veth,
>>> we can automatically enable the napi ring of peer veth, which seems to
>>> have no performance impact and functional impact on the veth pair, and
>>> no longer requires users to do more things for peer veth (after all,
>>> they may be unaware of more requirements for peer veth). Do you think
>>> this is feasible?
>> It could be, perhaps? One issue is what to do once the XDP program is
>> then unloaded? We should probably disable NAPI on the peer in this case,
>> but then we'd need to track whether it was enabled by loading an XDP
>> program; we don't want to disable GRO/NAPI if the user requested it
>> explicitly. This kind of state tracking gets icky fast, so I guess it'll
>> depend on the patch...
>
> Regarding tracking whether we disable the napi of peer veth when 
> unloading
> the veth's xdp program, this can actually be handled cleanly.
>
> We need to note that when peer veth enable GRO, the peer veth device will
> update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to
> __netdev_update_features() and veth_set_features() ).
>
> When veth loads the xdp program and the napi of peer veth is not ready
> (that is, peer veth does not load the xdp program or has no enable gro),
> at this time, we can choose `ethtool -K veth0 gro on` to enable the 
> napi of
> peer veth, this command also makes the peer veth device update their
> wanted_features, or choose we automatically enable napi for peer veth.
>
> If we want to unload the xdp program for veth, peer veth cannot directly
> disable its napi, because we need to judge whether peer veth is 
> gro_requested
> ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just
> clean veth's xdp environment and disable the napi of veth instead of
> directly disable the napi of peer veth, because of the existence of the
> gro_requested and the xdp program loading on peer veth.
>
> But, if peer veth does not have gro_requested or xdp_program loading 
> on itself,
> then we can directly disable the napi of peer veth.

Hi, Toke. Do you think the above solution is effective for the problem 
of veth
xdp_rediect dropping packets? Or is there more to add? If the above solution
seems to be no problem for the time being, I'm ready to start with this idea
and try to make the corresponding patch.

Thanks.

>
> Thanks.
>
>> -Toke
>
Toke Høiland-Jørgensen Oct. 24, 2022, 1:34 p.m. UTC | #12
Heng Qi <hengqi@linux.alibaba.com> writes:

> 在 2022/10/21 下午2:31, Heng Qi 写道:
>>
>>
>> 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道:
>>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>>
>>>> maybe we should consider a simpler method: when loading xdp in veth,
>>>> we can automatically enable the napi ring of peer veth, which seems to
>>>> have no performance impact and functional impact on the veth pair, and
>>>> no longer requires users to do more things for peer veth (after all,
>>>> they may be unaware of more requirements for peer veth). Do you think
>>>> this is feasible?
>>> It could be, perhaps? One issue is what to do once the XDP program is
>>> then unloaded? We should probably disable NAPI on the peer in this case,
>>> but then we'd need to track whether it was enabled by loading an XDP
>>> program; we don't want to disable GRO/NAPI if the user requested it
>>> explicitly. This kind of state tracking gets icky fast, so I guess it'll
>>> depend on the patch...
>>
>> Regarding tracking whether we disable the napi of peer veth when 
>> unloading
>> the veth's xdp program, this can actually be handled cleanly.
>>
>> We need to note that when peer veth enable GRO, the peer veth device will
>> update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to
>> __netdev_update_features() and veth_set_features() ).
>>
>> When veth loads the xdp program and the napi of peer veth is not ready
>> (that is, peer veth does not load the xdp program or has no enable gro),
>> at this time, we can choose `ethtool -K veth0 gro on` to enable the 
>> napi of
>> peer veth, this command also makes the peer veth device update their
>> wanted_features, or choose we automatically enable napi for peer veth.
>>
>> If we want to unload the xdp program for veth, peer veth cannot directly
>> disable its napi, because we need to judge whether peer veth is 
>> gro_requested
>> ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just
>> clean veth's xdp environment and disable the napi of veth instead of
>> directly disable the napi of peer veth, because of the existence of the
>> gro_requested and the xdp program loading on peer veth.
>>
>> But, if peer veth does not have gro_requested or xdp_program loading 
>> on itself,
>> then we can directly disable the napi of peer veth.
>
> Hi, Toke. Do you think the above solution is effective for the problem 
> of veth
> xdp_rediect dropping packets? Or is there more to add? If the above solution
> seems to be no problem for the time being, I'm ready to start with this idea
> and try to make the corresponding patch.

I think it might? Please write a patch and we can have a look :)

-Toke
Heng Qi Oct. 24, 2022, 1:39 p.m. UTC | #13
在 2022/10/24 下午9:34, Toke Høiland-Jørgensen 写道:
> Heng Qi <hengqi@linux.alibaba.com> writes:
>
>> 在 2022/10/21 下午2:31, Heng Qi 写道:
>>>
>>> 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道:
>>>> Heng Qi <hengqi@linux.alibaba.com> writes:
>>>>
>>>>> maybe we should consider a simpler method: when loading xdp in veth,
>>>>> we can automatically enable the napi ring of peer veth, which seems to
>>>>> have no performance impact and functional impact on the veth pair, and
>>>>> no longer requires users to do more things for peer veth (after all,
>>>>> they may be unaware of more requirements for peer veth). Do you think
>>>>> this is feasible?
>>>> It could be, perhaps? One issue is what to do once the XDP program is
>>>> then unloaded? We should probably disable NAPI on the peer in this case,
>>>> but then we'd need to track whether it was enabled by loading an XDP
>>>> program; we don't want to disable GRO/NAPI if the user requested it
>>>> explicitly. This kind of state tracking gets icky fast, so I guess it'll
>>>> depend on the patch...
>>> Regarding tracking whether we disable the napi of peer veth when
>>> unloading
>>> the veth's xdp program, this can actually be handled cleanly.
>>>
>>> We need to note that when peer veth enable GRO, the peer veth device will
>>> update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to
>>> __netdev_update_features() and veth_set_features() ).
>>>
>>> When veth loads the xdp program and the napi of peer veth is not ready
>>> (that is, peer veth does not load the xdp program or has no enable gro),
>>> at this time, we can choose `ethtool -K veth0 gro on` to enable the
>>> napi of
>>> peer veth, this command also makes the peer veth device update their
>>> wanted_features, or choose we automatically enable napi for peer veth.
>>>
>>> If we want to unload the xdp program for veth, peer veth cannot directly
>>> disable its napi, because we need to judge whether peer veth is
>>> gro_requested
>>> ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just
>>> clean veth's xdp environment and disable the napi of veth instead of
>>> directly disable the napi of peer veth, because of the existence of the
>>> gro_requested and the xdp program loading on peer veth.
>>>
>>> But, if peer veth does not have gro_requested or xdp_program loading
>>> on itself,
>>> then we can directly disable the napi of peer veth.
>> Hi, Toke. Do you think the above solution is effective for the problem
>> of veth
>> xdp_rediect dropping packets? Or is there more to add? If the above solution
>> seems to be no problem for the time being, I'm ready to start with this idea
>> and try to make the corresponding patch.
> I think it might? Please write a patch and we can have a look :)

Okay, I will do it right away.

Thanks.

>
> -Toke
diff mbox series

Patch

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 466da01..e1f5561 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -469,8 +469,42 @@  static int veth_xdp_xmit(struct net_device *dev, int n,
 	/* The napi pointer is set if NAPI is enabled, which ensures that
 	 * xdp_ring is initialized on receive side and the peer device is up.
 	 */
-	if (!rcu_access_pointer(rq->napi))
+	if (!rcu_access_pointer(rq->napi)) {
+		for (i = 0; i < n; i++) {
+			struct xdp_frame *xdpf = frames[i];
+			struct netdev_queue *txq = NULL;
+			struct sk_buff *skb;
+			int queue_mapping;
+			u16 mac_len;
+
+			skb = xdp_build_skb_from_frame(xdpf, dev);
+			if (unlikely(!skb)) {
+				ret = nxmit;
+				goto out;
+			}
+
+			/* We need to restore ETH header, because it is pulled
+			 * in eth_type_trans.
+			 */
+			mac_len = skb->data - skb_mac_header(skb);
+			skb_push(skb, mac_len);
+
+			nxmit++;
+
+			queue_mapping = skb_get_queue_mapping(skb);
+			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
+			__netif_tx_lock(txq, smp_processor_id());
+			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
+				__netif_tx_unlock(txq);
+				ret = nxmit;
+				goto out;
+			}
+			__netif_tx_unlock(txq);
+		}
+
+		ret = nxmit;
 		goto out;
+	}
 
 	max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;