Message ID | 1664267413-75518-1-git-send-email-hengqi@linux.alibaba.com (mailing list archive) |
---|---|
State | Rejected |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] veth: Avoid drop packets when xdp_redirect performs | expand |
Heng Qi <hengqi@linux.alibaba.com> writes: > In the current processing logic, when xdp_redirect occurs, it transmits > the xdp frame based on napi. > > If napi of the peer veth is not ready, the veth will drop the packets. > This doesn't meet our expectations. Erm, why don't you just enable NAPI? Loading an XDP program is not needed these days, you can just enable GRO on both peers... > In this context, if napi is not ready, we convert the xdp frame to a skb, > and then use veth_xmit() to deliver it to the peer veth. > > Like the following case: > Even if veth1's napi cannot be used, the packet redirected from the NIC > will be transmitted to veth1 successfully: > > NIC -> veth0----veth1 > | | > (XDP) (no XDP) > > Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > --- > drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- > 1 file changed, 35 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > index 466da01..e1f5561 100644 > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, > /* The napi pointer is set if NAPI is enabled, which ensures that > * xdp_ring is initialized on receive side and the peer device is up. > */ > - if (!rcu_access_pointer(rq->napi)) > + if (!rcu_access_pointer(rq->napi)) { > + for (i = 0; i < n; i++) { > + struct xdp_frame *xdpf = frames[i]; > + struct netdev_queue *txq = NULL; > + struct sk_buff *skb; > + int queue_mapping; > + u16 mac_len; > + > + skb = xdp_build_skb_from_frame(xdpf, dev); > + if (unlikely(!skb)) { > + ret = nxmit; > + goto out; > + } > + > + /* We need to restore ETH header, because it is pulled > + * in eth_type_trans. > + */ > + mac_len = skb->data - skb_mac_header(skb); > + skb_push(skb, mac_len); > + > + nxmit++; > + > + queue_mapping = skb_get_queue_mapping(skb); > + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); > + __netif_tx_lock(txq, smp_processor_id()); > + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { > + __netif_tx_unlock(txq); > + ret = nxmit; > + goto out; > + } > + __netif_tx_unlock(txq); Locking and unlocking the txq repeatedly for each packet? Yikes! Did you measure the performance overhead of this? -Toke
在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道: > Heng Qi <hengqi@linux.alibaba.com> writes: > >> In the current processing logic, when xdp_redirect occurs, it transmits >> the xdp frame based on napi. >> >> If napi of the peer veth is not ready, the veth will drop the packets. >> This doesn't meet our expectations. > Erm, why don't you just enable NAPI? Loading an XDP program is not > needed these days, you can just enable GRO on both peers... In general, we don't expect veth to drop packets when it doesn't mount the xdp program or otherwise, because this is not as expected. >> In this context, if napi is not ready, we convert the xdp frame to a skb, >> and then use veth_xmit() to deliver it to the peer veth. >> >> Like the following case: >> Even if veth1's napi cannot be used, the packet redirected from the NIC >> will be transmitted to veth1 successfully: >> >> NIC -> veth0----veth1 >> | | >> (XDP) (no XDP) >> >> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> >> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >> --- >> drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- >> 1 file changed, 35 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >> index 466da01..e1f5561 100644 >> --- a/drivers/net/veth.c >> +++ b/drivers/net/veth.c >> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >> /* The napi pointer is set if NAPI is enabled, which ensures that >> * xdp_ring is initialized on receive side and the peer device is up. >> */ >> - if (!rcu_access_pointer(rq->napi)) >> + if (!rcu_access_pointer(rq->napi)) { >> + for (i = 0; i < n; i++) { >> + struct xdp_frame *xdpf = frames[i]; >> + struct netdev_queue *txq = NULL; >> + struct sk_buff *skb; >> + int queue_mapping; >> + u16 mac_len; >> + >> + skb = xdp_build_skb_from_frame(xdpf, dev); >> + if (unlikely(!skb)) { >> + ret = nxmit; >> + goto out; >> + } >> + >> + /* We need to restore ETH header, because it is pulled >> + * in eth_type_trans. >> + */ >> + mac_len = skb->data - skb_mac_header(skb); >> + skb_push(skb, mac_len); >> + >> + nxmit++; >> + >> + queue_mapping = skb_get_queue_mapping(skb); >> + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); >> + __netif_tx_lock(txq, smp_processor_id()); >> + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { >> + __netif_tx_unlock(txq); >> + ret = nxmit; >> + goto out; >> + } >> + __netif_tx_unlock(txq); > Locking and unlocking the txq repeatedly for each packet? Yikes! Did you > measure the performance overhead of this? Yes, there are indeed some optimizations that can be done here, like putting the lock outside the loop. But in __dev_queue_xmit(), where each packet sent is also protected by a lock. Thanks. > > -Toke
Heng Qi <hengqi@linux.alibaba.com> writes: > 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道: >> Heng Qi <hengqi@linux.alibaba.com> writes: >> >>> In the current processing logic, when xdp_redirect occurs, it transmits >>> the xdp frame based on napi. >>> >>> If napi of the peer veth is not ready, the veth will drop the packets. >>> This doesn't meet our expectations. >> Erm, why don't you just enable NAPI? Loading an XDP program is not >> needed these days, you can just enable GRO on both peers... > > In general, we don't expect veth to drop packets when it doesn't mount > the xdp program or otherwise, because this is not as expected. Well, did you consider that maybe your expectation is wrong? ;) >>> In this context, if napi is not ready, we convert the xdp frame to a skb, >>> and then use veth_xmit() to deliver it to the peer veth. >>> >>> Like the following case: >>> Even if veth1's napi cannot be used, the packet redirected from the NIC >>> will be transmitted to veth1 successfully: >>> >>> NIC -> veth0----veth1 >>> | | >>> (XDP) (no XDP) >>> >>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> >>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>> --- >>> drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- >>> 1 file changed, 35 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>> index 466da01..e1f5561 100644 >>> --- a/drivers/net/veth.c >>> +++ b/drivers/net/veth.c >>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >>> /* The napi pointer is set if NAPI is enabled, which ensures that >>> * xdp_ring is initialized on receive side and the peer device is up. >>> */ >>> - if (!rcu_access_pointer(rq->napi)) >>> + if (!rcu_access_pointer(rq->napi)) { >>> + for (i = 0; i < n; i++) { >>> + struct xdp_frame *xdpf = frames[i]; >>> + struct netdev_queue *txq = NULL; >>> + struct sk_buff *skb; >>> + int queue_mapping; >>> + u16 mac_len; >>> + >>> + skb = xdp_build_skb_from_frame(xdpf, dev); >>> + if (unlikely(!skb)) { >>> + ret = nxmit; >>> + goto out; >>> + } >>> + >>> + /* We need to restore ETH header, because it is pulled >>> + * in eth_type_trans. >>> + */ >>> + mac_len = skb->data - skb_mac_header(skb); >>> + skb_push(skb, mac_len); >>> + >>> + nxmit++; >>> + >>> + queue_mapping = skb_get_queue_mapping(skb); >>> + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); >>> + __netif_tx_lock(txq, smp_processor_id()); >>> + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { >>> + __netif_tx_unlock(txq); >>> + ret = nxmit; >>> + goto out; >>> + } >>> + __netif_tx_unlock(txq); >> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you >> measure the performance overhead of this? > > Yes, there are indeed some optimizations that can be done here, > like putting the lock outside the loop. > But in __dev_queue_xmit(), where each packet sent is also protected by a lock. ...which is another reason why this is a bad idea: it's going to perform terribly, which means we'll just end up with users wondering why their XDP performance is terrible and we're going to have to tell them to turn on GRO anyway. So why not do this from the beginning? If you want to change the default, flipping GRO to be on by default is a better solution IMO. I don't actually recall why we didn't do that when the support was added, but maybe Paolo remembers? -Toke
在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道: > Heng Qi <hengqi@linux.alibaba.com> writes: > >> 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道: >>> Heng Qi <hengqi@linux.alibaba.com> writes: >>> >>>> In the current processing logic, when xdp_redirect occurs, it transmits >>>> the xdp frame based on napi. >>>> >>>> If napi of the peer veth is not ready, the veth will drop the packets. >>>> This doesn't meet our expectations. >>> Erm, why don't you just enable NAPI? Loading an XDP program is not >>> needed these days, you can just enable GRO on both peers... >> In general, we don't expect veth to drop packets when it doesn't mount >> the xdp program or otherwise, because this is not as expected. > Well, did you consider that maybe your expectation is wrong? ;) For users who don't know what other conditions are required for the readiness of napi, all they can observe is why the packets cannot be sent to the peer veth, which is also the problem we encountered in the actual case scenarios. > >>>> In this context, if napi is not ready, we convert the xdp frame to a skb, >>>> and then use veth_xmit() to deliver it to the peer veth. >>>> >>>> Like the following case: >>>> Even if veth1's napi cannot be used, the packet redirected from the NIC >>>> will be transmitted to veth1 successfully: >>>> >>>> NIC -> veth0----veth1 >>>> | | >>>> (XDP) (no XDP) >>>> >>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> >>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>> --- >>>> drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- >>>> 1 file changed, 35 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>>> index 466da01..e1f5561 100644 >>>> --- a/drivers/net/veth.c >>>> +++ b/drivers/net/veth.c >>>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >>>> /* The napi pointer is set if NAPI is enabled, which ensures that >>>> * xdp_ring is initialized on receive side and the peer device is up. >>>> */ >>>> - if (!rcu_access_pointer(rq->napi)) >>>> + if (!rcu_access_pointer(rq->napi)) { >>>> + for (i = 0; i < n; i++) { >>>> + struct xdp_frame *xdpf = frames[i]; >>>> + struct netdev_queue *txq = NULL; >>>> + struct sk_buff *skb; >>>> + int queue_mapping; >>>> + u16 mac_len; >>>> + >>>> + skb = xdp_build_skb_from_frame(xdpf, dev); >>>> + if (unlikely(!skb)) { >>>> + ret = nxmit; >>>> + goto out; >>>> + } >>>> + >>>> + /* We need to restore ETH header, because it is pulled >>>> + * in eth_type_trans. >>>> + */ >>>> + mac_len = skb->data - skb_mac_header(skb); >>>> + skb_push(skb, mac_len); >>>> + >>>> + nxmit++; >>>> + >>>> + queue_mapping = skb_get_queue_mapping(skb); >>>> + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); >>>> + __netif_tx_lock(txq, smp_processor_id()); >>>> + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { >>>> + __netif_tx_unlock(txq); >>>> + ret = nxmit; >>>> + goto out; >>>> + } >>>> + __netif_tx_unlock(txq); >>> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you >>> measure the performance overhead of this? >> Yes, there are indeed some optimizations that can be done here, >> like putting the lock outside the loop. >> But in __dev_queue_xmit(), where each packet sent is also protected by a lock. > ...which is another reason why this is a bad idea: it's going to perform > terribly, which means we'll just end up with users wondering why their > XDP performance is terrible and we're going to have to tell them to turn > on GRO anyway. So why not do this from the beginning? > > If you want to change the default, flipping GRO to be on by default is a > better solution IMO. I don't actually recall why we didn't do that when > the support was added, but maybe Paolo remembers? As I said above in the real case, the user's concern is not why the performance of xdp becomes bad, but why the data packets are not received. The default number of veth queues is not num_possible_cpus(). When GRO is enabled by default, if there is only one veth queue, but multiple CPUs read and write at the same time, the efficiency of napi is actually very low due to the existence of production locks and races. On the contrary, the default veth_xmit() each cpu has its own unique queue, and this way of sending and receiving packets is also efficient. Thanks. > > -Toke
Hello, On Thu, 2022-09-29 at 10:50 +0800, Heng Qi wrote: > 在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道: > > Heng Qi <hengqi@linux.alibaba.com> writes: > > > > > 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道: > > > > Heng Qi <hengqi@linux.alibaba.com> writes: > > > > > > > > > In the current processing logic, when xdp_redirect occurs, it transmits > > > > > the xdp frame based on napi. > > > > > > > > > > If napi of the peer veth is not ready, the veth will drop the packets. > > > > > This doesn't meet our expectations. > > > > Erm, why don't you just enable NAPI? Loading an XDP program is not > > > > needed these days, you can just enable GRO on both peers... > > > In general, we don't expect veth to drop packets when it doesn't mount > > > the xdp program or otherwise, because this is not as expected. > > Well, did you consider that maybe your expectation is wrong? ;) > > For users who don't know what other conditions are required for the readiness of napi, > all they can observe is why the packets cannot be sent to the peer veth, which is also > the problem we encountered in the actual case scenarios. > > > > > > > > > In this context, if napi is not ready, we convert the xdp frame to a skb, > > > > > and then use veth_xmit() to deliver it to the peer veth. > > > > > > > > > > Like the following case: > > > > > Even if veth1's napi cannot be used, the packet redirected from the NIC > > > > > will be transmitted to veth1 successfully: > > > > > > > > > > NIC -> veth0----veth1 > > > > > | | > > > > > (XDP) (no XDP) > > > > > > > > > > Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > --- > > > > > drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- > > > > > 1 file changed, 35 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > > > > > index 466da01..e1f5561 100644 > > > > > --- a/drivers/net/veth.c > > > > > +++ b/drivers/net/veth.c > > > > > @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, > > > > > /* The napi pointer is set if NAPI is enabled, which ensures that > > > > > * xdp_ring is initialized on receive side and the peer device is up. > > > > > */ > > > > > - if (!rcu_access_pointer(rq->napi)) > > > > > + if (!rcu_access_pointer(rq->napi)) { > > > > > + for (i = 0; i < n; i++) { > > > > > + struct xdp_frame *xdpf = frames[i]; > > > > > + struct netdev_queue *txq = NULL; > > > > > + struct sk_buff *skb; > > > > > + int queue_mapping; > > > > > + u16 mac_len; > > > > > + > > > > > + skb = xdp_build_skb_from_frame(xdpf, dev); > > > > > + if (unlikely(!skb)) { > > > > > + ret = nxmit; > > > > > + goto out; > > > > > + } > > > > > + > > > > > + /* We need to restore ETH header, because it is pulled > > > > > + * in eth_type_trans. > > > > > + */ > > > > > + mac_len = skb->data - skb_mac_header(skb); > > > > > + skb_push(skb, mac_len); > > > > > + > > > > > + nxmit++; > > > > > + > > > > > + queue_mapping = skb_get_queue_mapping(skb); > > > > > + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); > > > > > + __netif_tx_lock(txq, smp_processor_id()); > > > > > + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { > > > > > + __netif_tx_unlock(txq); > > > > > + ret = nxmit; > > > > > + goto out; > > > > > + } > > > > > + __netif_tx_unlock(txq); > > > > Locking and unlocking the txq repeatedly for each packet? Yikes! Did you > > > > measure the performance overhead of this? > > > Yes, there are indeed some optimizations that can be done here, > > > like putting the lock outside the loop. > > > But in __dev_queue_xmit(), where each packet sent is also protected by a lock. > > ...which is another reason why this is a bad idea: it's going to perform > > terribly, which means we'll just end up with users wondering why their > > XDP performance is terrible and we're going to have to tell them to turn > > on GRO anyway. So why not do this from the beginning? > > > > If you want to change the default, flipping GRO to be on by default is a > > better solution IMO. I don't actually recall why we didn't do that when > > the support was added, but maybe Paolo remembers? I preferred to avoid changing the default behavior. Additionally, the veth GRO stage needs some tuning to really be able to aggregate the packets (e.g. napi thread or gro_flush_timeout > 0) > As I said above in the real case, the user's concern is not why the performance > of xdp becomes bad, but why the data packets are not received. Well, that arguably tells the end-user there is something wrong in their setup. On the flip side, having a functionally working setup with horrible performances would likely lead the users (perhaps not yours, surely others) in very wrong directions (from "XDP is slow" to "the problem is in the application")... > The default number of veth queues is not num_possible_cpus(). When GRO is enabled > by default, if there is only one veth queue, but multiple CPUs read and write at the > same time, the efficiency of napi is actually very low due to the existence of > production locks and races. On the contrary, the default veth_xmit() each cpu has > its own unique queue, and this way of sending and receiving packets is also efficient. > This patch adds a bit of complexity and it looks completely avoidable with some configuration - you could enable GRO and set the number of queues to num_possible_cpus(). I agree with Toke, you should explain the end-users that their expecations are wrong, and guide them towards a better setup. Thanks! Paolo
在 2022/9/29 下午2:57, Paolo Abeni 写道: > Hello, > > On Thu, 2022-09-29 at 10:50 +0800, Heng Qi wrote: >> 在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道: >>> Heng Qi <hengqi@linux.alibaba.com> writes: >>> >>>> 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道: >>>>> Heng Qi <hengqi@linux.alibaba.com> writes: >>>>> >>>>>> In the current processing logic, when xdp_redirect occurs, it transmits >>>>>> the xdp frame based on napi. >>>>>> >>>>>> If napi of the peer veth is not ready, the veth will drop the packets. >>>>>> This doesn't meet our expectations. >>>>> Erm, why don't you just enable NAPI? Loading an XDP program is not >>>>> needed these days, you can just enable GRO on both peers... >>>> In general, we don't expect veth to drop packets when it doesn't mount >>>> the xdp program or otherwise, because this is not as expected. >>> Well, did you consider that maybe your expectation is wrong? ;) >> For users who don't know what other conditions are required for the readiness of napi, >> all they can observe is why the packets cannot be sent to the peer veth, which is also >> the problem we encountered in the actual case scenarios. >> >> >>>>>> In this context, if napi is not ready, we convert the xdp frame to a skb, >>>>>> and then use veth_xmit() to deliver it to the peer veth. >>>>>> >>>>>> Like the following case: >>>>>> Even if veth1's napi cannot be used, the packet redirected from the NIC >>>>>> will be transmitted to veth1 successfully: >>>>>> >>>>>> NIC -> veth0----veth1 >>>>>> | | >>>>>> (XDP) (no XDP) >>>>>> >>>>>> Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> >>>>>> Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>>>> --- >>>>>> drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++- >>>>>> 1 file changed, 35 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c >>>>>> index 466da01..e1f5561 100644 >>>>>> --- a/drivers/net/veth.c >>>>>> +++ b/drivers/net/veth.c >>>>>> @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, >>>>>> /* The napi pointer is set if NAPI is enabled, which ensures that >>>>>> * xdp_ring is initialized on receive side and the peer device is up. >>>>>> */ >>>>>> - if (!rcu_access_pointer(rq->napi)) >>>>>> + if (!rcu_access_pointer(rq->napi)) { >>>>>> + for (i = 0; i < n; i++) { >>>>>> + struct xdp_frame *xdpf = frames[i]; >>>>>> + struct netdev_queue *txq = NULL; >>>>>> + struct sk_buff *skb; >>>>>> + int queue_mapping; >>>>>> + u16 mac_len; >>>>>> + >>>>>> + skb = xdp_build_skb_from_frame(xdpf, dev); >>>>>> + if (unlikely(!skb)) { >>>>>> + ret = nxmit; >>>>>> + goto out; >>>>>> + } >>>>>> + >>>>>> + /* We need to restore ETH header, because it is pulled >>>>>> + * in eth_type_trans. >>>>>> + */ >>>>>> + mac_len = skb->data - skb_mac_header(skb); >>>>>> + skb_push(skb, mac_len); >>>>>> + >>>>>> + nxmit++; >>>>>> + >>>>>> + queue_mapping = skb_get_queue_mapping(skb); >>>>>> + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); >>>>>> + __netif_tx_lock(txq, smp_processor_id()); >>>>>> + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { >>>>>> + __netif_tx_unlock(txq); >>>>>> + ret = nxmit; >>>>>> + goto out; >>>>>> + } >>>>>> + __netif_tx_unlock(txq); >>>>> Locking and unlocking the txq repeatedly for each packet? Yikes! Did you >>>>> measure the performance overhead of this? >>>> Yes, there are indeed some optimizations that can be done here, >>>> like putting the lock outside the loop. >>>> But in __dev_queue_xmit(), where each packet sent is also protected by a lock. >>> ...which is another reason why this is a bad idea: it's going to perform >>> terribly, which means we'll just end up with users wondering why their >>> XDP performance is terrible and we're going to have to tell them to turn >>> on GRO anyway. So why not do this from the beginning? >>> >>> If you want to change the default, flipping GRO to be on by default is a >>> better solution IMO. I don't actually recall why we didn't do that when >>> the support was added, but maybe Paolo remembers? > I preferred to avoid changing the default behavior. Additionally, the > veth GRO stage needs some tuning to really be able to aggregate the > packets (e.g. napi thread or gro_flush_timeout > 0) > >> As I said above in the real case, the user's concern is not why the performance >> of xdp becomes bad, but why the data packets are not received. > Well, that arguably tells the end-user there is something wrong in > their setup. On the flip side, having a functionally working setup with > horrible performances would likely lead the users (perhaps not yours, > surely others) in very wrong directions (from "XDP is slow" to "the > problem is in the application")... > >> The default number of veth queues is not num_possible_cpus(). When GRO is enabled >> by default, if there is only one veth queue, but multiple CPUs read and write at the >> same time, the efficiency of napi is actually very low due to the existence of >> production locks and races. On the contrary, the default veth_xmit() each cpu has >> its own unique queue, and this way of sending and receiving packets is also efficient. >> > This patch adds a bit of complexity and it looks completely avoidable > with some configuration - you could enable GRO and set the number of > queues to num_possible_cpus(). > > I agree with Toke, you should explain the end-users that their > expecations are wrong, and guide them towards a better setup. > > Thanks! Well, one thing I want to know is that in the following scenario, NIC -> veth0----veth1 | | | (XDP) (XDP) (no XDP) xdp_redirect is triggered, and NIC and veth0 are both mounted with the xdp program, then why our default behavior is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp program, the napi ring of veth1 is opened by default at the same time? Why not make it like this, but we must configure a simple xdp program on veth1? Thanks. > > Paolo
Heng Qi <hengqi@linux.alibaba.com> writes: >>> As I said above in the real case, the user's concern is not why the performance >>> of xdp becomes bad, but why the data packets are not received. >> Well, that arguably tells the end-user there is something wrong in >> their setup. On the flip side, having a functionally working setup with >> horrible performances would likely lead the users (perhaps not yours, >> surely others) in very wrong directions (from "XDP is slow" to "the >> problem is in the application")... >> >>> The default number of veth queues is not num_possible_cpus(). When GRO is enabled >>> by default, if there is only one veth queue, but multiple CPUs read and write at the >>> same time, the efficiency of napi is actually very low due to the existence of >>> production locks and races. On the contrary, the default veth_xmit() each cpu has >>> its own unique queue, and this way of sending and receiving packets is also efficient. >>> >> This patch adds a bit of complexity and it looks completely avoidable >> with some configuration - you could enable GRO and set the number of >> queues to num_possible_cpus(). >> >> I agree with Toke, you should explain the end-users that their >> expecations are wrong, and guide them towards a better setup. >> >> Thanks! > > Well, one thing I want to know is that in the following scenario, > > NIC -> veth0----veth1 > | | | > (XDP) (XDP) (no XDP) > > xdp_redirect is triggered, > and NIC and veth0 are both mounted with the xdp program, then why our default behavior > is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp > program, the napi ring of veth1 is opened by default at the same time? Why not make it like > this, but we must configure a simple xdp program on veth1? As I said in my initial reply, you don't actually need to load an XDP program (anymore), it's enough to enable GRO through ethtool on both peers. You can easily do this on setup if you know XDP is going to be used in your environment. -Toke
在 2022/9/29 下午8:08, Toke Høiland-Jørgensen 写道: > Heng Qi <hengqi@linux.alibaba.com> writes: > >>>> As I said above in the real case, the user's concern is not why the performance >>>> of xdp becomes bad, but why the data packets are not received. >>> Well, that arguably tells the end-user there is something wrong in >>> their setup. On the flip side, having a functionally working setup with >>> horrible performances would likely lead the users (perhaps not yours, >>> surely others) in very wrong directions (from "XDP is slow" to "the >>> problem is in the application")... >>> >>>> The default number of veth queues is not num_possible_cpus(). When GRO is enabled >>>> by default, if there is only one veth queue, but multiple CPUs read and write at the >>>> same time, the efficiency of napi is actually very low due to the existence of >>>> production locks and races. On the contrary, the default veth_xmit() each cpu has >>>> its own unique queue, and this way of sending and receiving packets is also efficient. >>>> >>> This patch adds a bit of complexity and it looks completely avoidable >>> with some configuration - you could enable GRO and set the number of >>> queues to num_possible_cpus(). >>> >>> I agree with Toke, you should explain the end-users that their >>> expecations are wrong, and guide them towards a better setup. >>> >>> Thanks! >> Well, one thing I want to know is that in the following scenario, >> >> NIC -> veth0----veth1 >> | | | >> (XDP) (XDP) (no XDP) >> >> xdp_redirect is triggered, >> and NIC and veth0 are both mounted with the xdp program, then why our default behavior >> is to drop packets that should be sent to veth1 instead of when veth0 is mounted with xdp >> program, the napi ring of veth1 is opened by default at the same time? Why not make it like >> this, but we must configure a simple xdp program on veth1? > As I said in my initial reply, you don't actually need to load an XDP > program (anymore), it's enough to enable GRO through ethtool on both > peers. You can easily do this on setup if you know XDP is going to be > used in your environment. This does serve our purpose, but in fact, users of veth pair do not necessarily understand how it works. In this case, for users who are not familiar with veth pair, they may not know that they need to enable GRO or load the xdp program for peer veth to ensure that the redirected packets can be received smoothly. In order to solve this overwhelming problem, they may go to take the time to look at the source code, or even find someone else to solve it, but we can avoid these with simple modifications (modifications may not be the rollback, using the backlog instead of the napi ring, made by this patch), for example, maybe we should consider a simpler method: when loading xdp in veth, we can automatically enable the napi ring of peer veth, which seems to have no performance impact and functional impact on the veth pair, and no longer requires users to do more things for peer veth (after all, they may be unaware of more requirements for peer veth). Do you think this is feasible? Thanks. > -Toke
Heng Qi <hengqi@linux.alibaba.com> writes: > maybe we should consider a simpler method: when loading xdp in veth, > we can automatically enable the napi ring of peer veth, which seems to > have no performance impact and functional impact on the veth pair, and > no longer requires users to do more things for peer veth (after all, > they may be unaware of more requirements for peer veth). Do you think > this is feasible? It could be, perhaps? One issue is what to do once the XDP program is then unloaded? We should probably disable NAPI on the peer in this case, but then we'd need to track whether it was enabled by loading an XDP program; we don't want to disable GRO/NAPI if the user requested it explicitly. This kind of state tracking gets icky fast, so I guess it'll depend on the patch... -Toke
在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道: > Heng Qi <hengqi@linux.alibaba.com> writes: > >> maybe we should consider a simpler method: when loading xdp in veth, >> we can automatically enable the napi ring of peer veth, which seems to >> have no performance impact and functional impact on the veth pair, and >> no longer requires users to do more things for peer veth (after all, >> they may be unaware of more requirements for peer veth). Do you think >> this is feasible? > It could be, perhaps? One issue is what to do once the XDP program is > then unloaded? We should probably disable NAPI on the peer in this case, > but then we'd need to track whether it was enabled by loading an XDP > program; we don't want to disable GRO/NAPI if the user requested it > explicitly. This kind of state tracking gets icky fast, so I guess it'll > depend on the patch... Regarding tracking whether we disable the napi of peer veth when unloading the veth's xdp program, this can actually be handled cleanly. We need to note that when peer veth enable GRO, the peer veth device will update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to __netdev_update_features() and veth_set_features() ). When veth loads the xdp program and the napi of peer veth is not ready (that is, peer veth does not load the xdp program or has no enable gro), at this time, we can choose `ethtool -K veth0 gro on` to enable the napi of peer veth, this command also makes the peer veth device update their wanted_features, or choose we automatically enable napi for peer veth. If we want to unload the xdp program for veth, peer veth cannot directly disable its napi, because we need to judge whether peer veth is gro_requested ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just clean veth's xdp environment and disable the napi of veth instead of directly disable the napi of peer veth, because of the existence of the gro_requested and the xdp program loading on peer veth. But, if peer veth does not have gro_requested or xdp_program loading on itself, then we can directly disable the napi of peer veth. Thanks. > -Toke
在 2022/10/21 下午2:31, Heng Qi 写道: > > > 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道: >> Heng Qi <hengqi@linux.alibaba.com> writes: >> >>> maybe we should consider a simpler method: when loading xdp in veth, >>> we can automatically enable the napi ring of peer veth, which seems to >>> have no performance impact and functional impact on the veth pair, and >>> no longer requires users to do more things for peer veth (after all, >>> they may be unaware of more requirements for peer veth). Do you think >>> this is feasible? >> It could be, perhaps? One issue is what to do once the XDP program is >> then unloaded? We should probably disable NAPI on the peer in this case, >> but then we'd need to track whether it was enabled by loading an XDP >> program; we don't want to disable GRO/NAPI if the user requested it >> explicitly. This kind of state tracking gets icky fast, so I guess it'll >> depend on the patch... > > Regarding tracking whether we disable the napi of peer veth when > unloading > the veth's xdp program, this can actually be handled cleanly. > > We need to note that when peer veth enable GRO, the peer veth device will > update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to > __netdev_update_features() and veth_set_features() ). > > When veth loads the xdp program and the napi of peer veth is not ready > (that is, peer veth does not load the xdp program or has no enable gro), > at this time, we can choose `ethtool -K veth0 gro on` to enable the > napi of > peer veth, this command also makes the peer veth device update their > wanted_features, or choose we automatically enable napi for peer veth. > > If we want to unload the xdp program for veth, peer veth cannot directly > disable its napi, because we need to judge whether peer veth is > gro_requested > ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just > clean veth's xdp environment and disable the napi of veth instead of > directly disable the napi of peer veth, because of the existence of the > gro_requested and the xdp program loading on peer veth. > > But, if peer veth does not have gro_requested or xdp_program loading > on itself, > then we can directly disable the napi of peer veth. Hi, Toke. Do you think the above solution is effective for the problem of veth xdp_rediect dropping packets? Or is there more to add? If the above solution seems to be no problem for the time being, I'm ready to start with this idea and try to make the corresponding patch. Thanks. > > Thanks. > >> -Toke >
Heng Qi <hengqi@linux.alibaba.com> writes: > 在 2022/10/21 下午2:31, Heng Qi 写道: >> >> >> 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道: >>> Heng Qi <hengqi@linux.alibaba.com> writes: >>> >>>> maybe we should consider a simpler method: when loading xdp in veth, >>>> we can automatically enable the napi ring of peer veth, which seems to >>>> have no performance impact and functional impact on the veth pair, and >>>> no longer requires users to do more things for peer veth (after all, >>>> they may be unaware of more requirements for peer veth). Do you think >>>> this is feasible? >>> It could be, perhaps? One issue is what to do once the XDP program is >>> then unloaded? We should probably disable NAPI on the peer in this case, >>> but then we'd need to track whether it was enabled by loading an XDP >>> program; we don't want to disable GRO/NAPI if the user requested it >>> explicitly. This kind of state tracking gets icky fast, so I guess it'll >>> depend on the patch... >> >> Regarding tracking whether we disable the napi of peer veth when >> unloading >> the veth's xdp program, this can actually be handled cleanly. >> >> We need to note that when peer veth enable GRO, the peer veth device will >> update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to >> __netdev_update_features() and veth_set_features() ). >> >> When veth loads the xdp program and the napi of peer veth is not ready >> (that is, peer veth does not load the xdp program or has no enable gro), >> at this time, we can choose `ethtool -K veth0 gro on` to enable the >> napi of >> peer veth, this command also makes the peer veth device update their >> wanted_features, or choose we automatically enable napi for peer veth. >> >> If we want to unload the xdp program for veth, peer veth cannot directly >> disable its napi, because we need to judge whether peer veth is >> gro_requested >> ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just >> clean veth's xdp environment and disable the napi of veth instead of >> directly disable the napi of peer veth, because of the existence of the >> gro_requested and the xdp program loading on peer veth. >> >> But, if peer veth does not have gro_requested or xdp_program loading >> on itself, >> then we can directly disable the napi of peer veth. > > Hi, Toke. Do you think the above solution is effective for the problem > of veth > xdp_rediect dropping packets? Or is there more to add? If the above solution > seems to be no problem for the time being, I'm ready to start with this idea > and try to make the corresponding patch. I think it might? Please write a patch and we can have a look :) -Toke
在 2022/10/24 下午9:34, Toke Høiland-Jørgensen 写道: > Heng Qi <hengqi@linux.alibaba.com> writes: > >> 在 2022/10/21 下午2:31, Heng Qi 写道: >>> >>> 在 2022/10/21 上午12:34, Toke Høiland-Jørgensen 写道: >>>> Heng Qi <hengqi@linux.alibaba.com> writes: >>>> >>>>> maybe we should consider a simpler method: when loading xdp in veth, >>>>> we can automatically enable the napi ring of peer veth, which seems to >>>>> have no performance impact and functional impact on the veth pair, and >>>>> no longer requires users to do more things for peer veth (after all, >>>>> they may be unaware of more requirements for peer veth). Do you think >>>>> this is feasible? >>>> It could be, perhaps? One issue is what to do once the XDP program is >>>> then unloaded? We should probably disable NAPI on the peer in this case, >>>> but then we'd need to track whether it was enabled by loading an XDP >>>> program; we don't want to disable GRO/NAPI if the user requested it >>>> explicitly. This kind of state tracking gets icky fast, so I guess it'll >>>> depend on the patch... >>> Regarding tracking whether we disable the napi of peer veth when >>> unloading >>> the veth's xdp program, this can actually be handled cleanly. >>> >>> We need to note that when peer veth enable GRO, the peer veth device will >>> update the `dev->wanted_features` with NETIF_F_GRO of peer veth (refer to >>> __netdev_update_features() and veth_set_features() ). >>> >>> When veth loads the xdp program and the napi of peer veth is not ready >>> (that is, peer veth does not load the xdp program or has no enable gro), >>> at this time, we can choose `ethtool -K veth0 gro on` to enable the >>> napi of >>> peer veth, this command also makes the peer veth device update their >>> wanted_features, or choose we automatically enable napi for peer veth. >>> >>> If we want to unload the xdp program for veth, peer veth cannot directly >>> disable its napi, because we need to judge whether peer veth is >>> gro_requested >>> ( ref to veth_gro_requested() ) or has its priv->_xdp_prog, if so, just >>> clean veth's xdp environment and disable the napi of veth instead of >>> directly disable the napi of peer veth, because of the existence of the >>> gro_requested and the xdp program loading on peer veth. >>> >>> But, if peer veth does not have gro_requested or xdp_program loading >>> on itself, >>> then we can directly disable the napi of peer veth. >> Hi, Toke. Do you think the above solution is effective for the problem >> of veth >> xdp_rediect dropping packets? Or is there more to add? If the above solution >> seems to be no problem for the time being, I'm ready to start with this idea >> and try to make the corresponding patch. > I think it might? Please write a patch and we can have a look :) Okay, I will do it right away. Thanks. > > -Toke
diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 466da01..e1f5561 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n, /* The napi pointer is set if NAPI is enabled, which ensures that * xdp_ring is initialized on receive side and the peer device is up. */ - if (!rcu_access_pointer(rq->napi)) + if (!rcu_access_pointer(rq->napi)) { + for (i = 0; i < n; i++) { + struct xdp_frame *xdpf = frames[i]; + struct netdev_queue *txq = NULL; + struct sk_buff *skb; + int queue_mapping; + u16 mac_len; + + skb = xdp_build_skb_from_frame(xdpf, dev); + if (unlikely(!skb)) { + ret = nxmit; + goto out; + } + + /* We need to restore ETH header, because it is pulled + * in eth_type_trans. + */ + mac_len = skb->data - skb_mac_header(skb); + skb_push(skb, mac_len); + + nxmit++; + + queue_mapping = skb_get_queue_mapping(skb); + txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping)); + __netif_tx_lock(txq, smp_processor_id()); + if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) { + __netif_tx_unlock(txq); + ret = nxmit; + goto out; + } + __netif_tx_unlock(txq); + } + + ret = nxmit; goto out; + } max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;