Message ID | 20230111130520.483222-1-dnlplm@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | add tx packets aggregation to ethtool and rmnet | expand |
On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote: > Hello maintainers and all, > > this patchset implements tx qmap packets aggregation in rmnet and generic > ethtool support for that. > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum > allowed throughput both in tx and rx during a bidirectional test if tx packets > aggregation is not enabled. One question I would have about this is if you are making use of Byte Queue Limits and netdev_xmit_more at all? I know for high speed devices most of us added support for xmit_more because PCIe bandwidth was limiting when we were writing the Tx ring indexes/doorbells with every packet. To overcome that we added netdev_xmit_more which told us when the Qdisc had more packets to give us. This allowed us to better utilize the PCIe bus by bursting packets through without adding additional latency. > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem > (50Mbps/150Mbps max throughput). What is actually happening is pictured at > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present. > > The same scenario with tx aggregation enabled is pictured at > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view > showing a regular graph. > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it > does not happen at the throughputs I'm able to test currently: maybe the same > could happen when moving close to the maximum rates supported by those modems. > Anyway, having the tx aggregation enabled should not hurt. > > The first attempt to solve this issue was in qmi_wwan qmap implementation, > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/ > > However, it turned out that rmnet was a better candidate for the implementation. > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration: > not sure if I got their advice right, but this patchset add also generic ethtool > support for tx aggregation. I have concerns about this essentially moving queueing disciplines down into the device. The idea of doing Tx aggregation seems like something that should be done with the qdisc rather than the device driver. Otherwise we are looking at having multiple implementations of this aggregation floating around. It seems like it would make more sense to have this batching happen at the qdisc layer, and then the qdisc layer would pass down a batch of frames w/ xmit_more set to indicate it is flushing the batch.
Hello Alexander, Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck <alexander.duyck@gmail.com> ha scritto: > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote: > > Hello maintainers and all, > > > > this patchset implements tx qmap packets aggregation in rmnet and generic > > ethtool support for that. > > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum > > allowed throughput both in tx and rx during a bidirectional test if tx packets > > aggregation is not enabled. > > One question I would have about this is if you are making use of Byte > Queue Limits and netdev_xmit_more at all? I know for high speed devices > most of us added support for xmit_more because PCIe bandwidth was > limiting when we were writing the Tx ring indexes/doorbells with every > packet. To overcome that we added netdev_xmit_more which told us when > the Qdisc had more packets to give us. This allowed us to better > utilize the PCIe bus by bursting packets through without adding > additional latency. > no, I was not aware of BQL: this development has been basically modelled on what other mobile broadband drivers do (e.g. cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are not using BQL. If I understand properly documentation rmnet0/queues/tx-0/byte_queue_limits/limit_max would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES. But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work in combination with the bytes limit: at least the first one is mandatory, since the modem can't receive more than a certain number (this is a variable value depending on the modem model and is collected through userspace tools). ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine that tx aggregation has been enabled by the userspace tool managing the qmi requests, otherwise no aggregation should be performed. > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view > > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present. > > > > The same scenario with tx aggregation enabled is pictured at > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view > > showing a regular graph. > > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it > > does not happen at the throughputs I'm able to test currently: maybe the same > > could happen when moving close to the maximum rates supported by those modems. > > Anyway, having the tx aggregation enabled should not hurt. > > > > The first attempt to solve this issue was in qmi_wwan qmap implementation, > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/ > > > > However, it turned out that rmnet was a better candidate for the implementation. > > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration: > > not sure if I got their advice right, but this patchset add also generic ethtool > > support for tx aggregation. > > I have concerns about this essentially moving queueing disciplines down > into the device. The idea of doing Tx aggregation seems like something > that should be done with the qdisc rather than the device driver. > Otherwise we are looking at having multiple implementations of this > aggregation floating around. It seems like it would make more sense to > have this batching happen at the qdisc layer, and then the qdisc layer > would pass down a batch of frames w/ xmit_more set to indicate it is > flushing the batch. Honestly, I'm not expert enough to give a reliable opinion about this. I feel like these settings are more related to the hardware, requiring also a configuration on the hardware itself done by the user, so ethtool would seem to me a good choice, but I may be biased since I did this development :-) Thanks, Daniele
On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote: > > Hello Alexander, > > Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck > <alexander.duyck@gmail.com> ha scritto: > > > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote: > > > Hello maintainers and all, > > > > > > this patchset implements tx qmap packets aggregation in rmnet and generic > > > ethtool support for that. > > > > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum > > > allowed throughput both in tx and rx during a bidirectional test if tx packets > > > aggregation is not enabled. > > > > One question I would have about this is if you are making use of Byte > > Queue Limits and netdev_xmit_more at all? I know for high speed devices > > most of us added support for xmit_more because PCIe bandwidth was > > limiting when we were writing the Tx ring indexes/doorbells with every > > packet. To overcome that we added netdev_xmit_more which told us when > > the Qdisc had more packets to give us. This allowed us to better > > utilize the PCIe bus by bursting packets through without adding > > additional latency. > > > > no, I was not aware of BQL: this development has been basically > modelled on what other mobile broadband drivers do (e.g. > cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are > not using BQL. > > If I understand properly documentation > > rmnet0/queues/tx-0/byte_queue_limits/limit_max > > would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES. Yes the general idea is that you end up targeting the upper limit for how many frames can be sent in a single burst. > But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES > and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work > in combination with the bytes limit: at least the first one is > mandatory, since the modem can't receive more than a certain number > (this is a variable value depending on the modem model and is > collected through userspace tools). In terms of MAX_FRAMES there isn't necessarily anything like that, but at the same time it isn't something that is already controlled by the netdev itself by using the netif_stop_queue or netif_stop_subqueue when there isn't space to store another frame. As such most devices control this by just manipulating their descriptor ring size via "ethtool -G <dev> tx <N>" As far as the TIME_USECS that is something I tried to propose a decade ago and was essentially given a hard "NAK" before xmit_more was introduced. We shouldn't be adding latency when we don't need to. Between GSO and xmit_more you should find that the network stack itself will naturally want to give you larger bursts of frames with xmit_more set. In addition, adding latency can mess with certain TCP algorithms and cause problematic behaviors. > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine > that tx aggregation has been enabled by the userspace tool managing > the qmi requests, otherwise no aggregation should be performed. Is there a specific reason why you wouldn't want to take advantage of aggregation that is already provided by the stack in the form of things such as GSO and the qdisc layer? I know most of the high speed NICs are always making use of xmit_more since things like GSO can take advantage of it to increase the throughput. Enabling BQL is a way of taking that one step further and enabling the non-GSO cases. > > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem > > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at > > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view > > > > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue > > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent > > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present. > > > > > > The same scenario with tx aggregation enabled is pictured at > > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view > > > showing a regular graph. > > > > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it > > > does not happen at the throughputs I'm able to test currently: maybe the same > > > could happen when moving close to the maximum rates supported by those modems. > > > Anyway, having the tx aggregation enabled should not hurt. > > > > > > The first attempt to solve this issue was in qmi_wwan qmap implementation, > > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/ > > > > > > However, it turned out that rmnet was a better candidate for the implementation. > > > > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration: > > > not sure if I got their advice right, but this patchset add also generic ethtool > > > support for tx aggregation. > > > > I have concerns about this essentially moving queueing disciplines down > > into the device. The idea of doing Tx aggregation seems like something > > that should be done with the qdisc rather than the device driver. > > Otherwise we are looking at having multiple implementations of this > > aggregation floating around. It seems like it would make more sense to > > have this batching happen at the qdisc layer, and then the qdisc layer > > would pass down a batch of frames w/ xmit_more set to indicate it is > > flushing the batch. > > Honestly, I'm not expert enough to give a reliable opinion about this. > > I feel like these settings are more related to the hardware, requiring > also a configuration on the hardware itself done by the user, so > ethtool would seem to me a good choice, but I may be biased since I > did this development :-) Yeah, I get that. I went through something similar when I had submitted a patch to defer Tx tail writes in order to try and improve the PCIe throughput. I would be open to revisiting this if we gave xmit_more and BQL a try and it doesn't take care of this, but from my past experience odds are the combination will likely resolve most of what you are seeing without adding additional latency. At a minimum the xmit_more should show a significant improvement in Tx throughput just from the benefits of GSO sending bursts of frames through that can easily be batched.
Dear Alexander: Thank you for taking a look at this thread. In earlier tests, the aggregation was indeed an improvement over the default behavior, but the amount of data that ended up living in the device was best case, 150ms, and worst case, 25 seconds, in the flent tcp_nup case. A better test would be the staggered start tcp_4up_squarewave test which would show only one tcp out of the 4, able to start with the current overall structure of this portion of the stack, once that buffer gets filled. The shiny new rust based Crusader test ( https://github.com/Zoxc/crusader ) has also now got a nice staggered start test that shows this problem in more detail. I was so peeved about 4g/5g behaviors like this in general that I was going to dedicate a blog entry to it; this recent rant only touches upon the real problem with engineering to the test, here: https://blog.cerowrt.org/post/speedtests/ Unfortunately your post is more about leveraging xmit_more - which is good, but the BQL structure for ethernet, itself leverages the concept of a completion interrupt (when the packets actually hit the air or wire), which I dearly wish was some previously unknown, single line hack to the driver or message in the usb API for 4g/5g etc. On Fri, Jan 13, 2023 at 8:17 AM Alexander Duyck <alexander.duyck@gmail.com> wrote: > > On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote: > > > > Hello Alexander, > > > > Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck > > <alexander.duyck@gmail.com> ha scritto: > > > > > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote: > > > > Hello maintainers and all, > > > > > > > > this patchset implements tx qmap packets aggregation in rmnet and generic > > > > ethtool support for that. > > > > > > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum > > > > allowed throughput both in tx and rx during a bidirectional test if tx packets > > > > aggregation is not enabled. > > > > > > One question I would have about this is if you are making use of Byte > > > Queue Limits and netdev_xmit_more at all? I know for high speed devices > > > most of us added support for xmit_more because PCIe bandwidth was > > > limiting when we were writing the Tx ring indexes/doorbells with every > > > packet. To overcome that we added netdev_xmit_more which told us when > > > the Qdisc had more packets to give us. This allowed us to better > > > utilize the PCIe bus by bursting packets through without adding > > > additional latency. > > > > > > > no, I was not aware of BQL: this development has been basically > > modelled on what other mobile broadband drivers do (e.g. > > cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are > > not using BQL. > > > > If I understand properly documentation > > > > rmnet0/queues/tx-0/byte_queue_limits/limit_max > > > > would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES. > > Yes the general idea is that you end up targeting the upper limit for > how many frames can be sent in a single burst. > > > But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES > > and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work > > in combination with the bytes limit: at least the first one is > > mandatory, since the modem can't receive more than a certain number > > (this is a variable value depending on the modem model and is > > collected through userspace tools). > > In terms of MAX_FRAMES there isn't necessarily anything like that, but > at the same time it isn't something that is already controlled by the > netdev itself by using the netif_stop_queue or netif_stop_subqueue > when there isn't space to store another frame. As such most devices > control this by just manipulating their descriptor ring size via > "ethtool -G <dev> tx <N>" > > As far as the TIME_USECS that is something I tried to propose a decade > ago and was essentially given a hard "NAK" before xmit_more was > introduced. We shouldn't be adding latency when we don't need to. > Between GSO and xmit_more you should find that the network stack > itself will naturally want to give you larger bursts of frames with > xmit_more set. In addition, adding latency can mess with certain TCP > algorithms and cause problematic behaviors. > > > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine > > that tx aggregation has been enabled by the userspace tool managing > > the qmi requests, otherwise no aggregation should be performed. > > Is there a specific reason why you wouldn't want to take advantage of > aggregation that is already provided by the stack in the form of > things such as GSO and the qdisc layer? I know most of the high speed > NICs are always making use of xmit_more since things like GSO can take > advantage of it to increase the throughput. Enabling BQL is a way of > taking that one step further and enabling the non-GSO cases. > > > > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem > > > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at > > > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view > > > > > > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue > > > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent > > > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present. > > > > > > > > The same scenario with tx aggregation enabled is pictured at > > > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view > > > > showing a regular graph. > > > > > > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it > > > > does not happen at the throughputs I'm able to test currently: maybe the same > > > > could happen when moving close to the maximum rates supported by those modems. > > > > Anyway, having the tx aggregation enabled should not hurt. > > > > > > > > The first attempt to solve this issue was in qmi_wwan qmap implementation, > > > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/ > > > > > > > > However, it turned out that rmnet was a better candidate for the implementation. > > > > > > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration: > > > > not sure if I got their advice right, but this patchset add also generic ethtool > > > > support for tx aggregation. > > > > > > I have concerns about this essentially moving queueing disciplines down > > > into the device. The idea of doing Tx aggregation seems like something > > > that should be done with the qdisc rather than the device driver. > > > Otherwise we are looking at having multiple implementations of this > > > aggregation floating around. It seems like it would make more sense to > > > have this batching happen at the qdisc layer, and then the qdisc layer > > > would pass down a batch of frames w/ xmit_more set to indicate it is > > > flushing the batch. > > > > Honestly, I'm not expert enough to give a reliable opinion about this. > > > > I feel like these settings are more related to the hardware, requiring > > also a configuration on the hardware itself done by the user, so > > ethtool would seem to me a good choice, but I may be biased since I > > did this development :-) > > Yeah, I get that. I went through something similar when I had > submitted a patch to defer Tx tail writes in order to try and improve > the PCIe throughput. I would be open to revisiting this if we gave > xmit_more and BQL a try and it doesn't take care of this, but from my > past experience odds are the combination will likely resolve most of > what you are seeing without adding additional latency. At a minimum > the xmit_more should show a significant improvement in Tx throughput > just from the benefits of GSO sending bursts of frames through that > can easily be batched.
On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote: > > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine > > that tx aggregation has been enabled by the userspace tool managing > > the qmi requests, otherwise no aggregation should be performed. > > Is there a specific reason why you wouldn't want to take advantage of > aggregation that is already provided by the stack in the form of > things such as GSO and the qdisc layer? I know most of the high speed > NICs are always making use of xmit_more since things like GSO can take > advantage of it to increase the throughput. Enabling BQL is a way of > taking that one step further and enabling the non-GSO cases. The patches had been applied last night by DaveM but FWIW I think Alex's idea is quite interesting. Even without BQL I believe you'd get xmit_more set within a single GSO super-frame. TCP sends data in chunks of 64kB, and you're only aggregating 32kB. If we could get the same effect without any added latency and user configuration that'd be great.
Hello: This series was applied to netdev/net-next.git (master) by David S. Miller <davem@davemloft.net>: On Wed, 11 Jan 2023 14:05:17 +0100 you wrote: > Hello maintainers and all, > > this patchset implements tx qmap packets aggregation in rmnet and generic > ethtool support for that. > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum > allowed throughput both in tx and rx during a bidirectional test if tx packets > aggregation is not enabled. > > [...] Here is the summary with links: - [net-next,v4,1/3] ethtool: add tx aggregation parameters https://git.kernel.org/netdev/net-next/c/31de2842399a - [net-next,v4,2/3] net: qualcomm: rmnet: add tx packets aggregation https://git.kernel.org/netdev/net-next/c/64b5d1f8f2d1 - [net-next,v4,3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation https://git.kernel.org/netdev/net-next/c/db8a563a9d90 You are awesome, thank you!
Hello Jakub and Alexander, Il giorno ven 13 gen 2023 alle ore 20:48 Jakub Kicinski <kuba@kernel.org> ha scritto: > > On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote: > > > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine > > > that tx aggregation has been enabled by the userspace tool managing > > > the qmi requests, otherwise no aggregation should be performed. > > > > Is there a specific reason why you wouldn't want to take advantage of > > aggregation that is already provided by the stack in the form of > > things such as GSO and the qdisc layer? I know most of the high speed > > NICs are always making use of xmit_more since things like GSO can take > > advantage of it to increase the throughput. Enabling BQL is a way of > > taking that one step further and enabling the non-GSO cases. > > The patches had been applied last night by DaveM but FWIW I think > Alex's idea is quite interesting. Even without BQL I believe you'd > get xmit_more set within a single GSO super-frame. TCP sends data > in chunks of 64kB, and you're only aggregating 32kB. If we could > get the same effect without any added latency and user configuration > that'd be great. Thanks for the hints, I'll explore xmit_more usage and try to gather some numbers to compare the two solutions. Regards, Daniele