[net-next,v4,0/3] add tx packets aggregation to ethtool and rmnet

Message ID	20230111130520.483222-1-dnlplm@gmail.com (mailing list archive)
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Daniele Palmas <dnlplm@gmail.com> To: David Miller <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Eric Dumazet <edumazet@google.com>, Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>, Sean Tranchetti <quic_stranche@quicinc.com>, Jonathan Corbet <corbet@lwn.net>, Alexander Lobakin <alexandr.lobakin@intel.com>, Gal Pressman <gal@nvidia.com>, Dave Taht <dave.taht@gmail.com> Cc: =?utf-8?q?Bj=C3=B8rn_Mork?= <bjorn@mork.no>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, netdev@vger.kernel.org, Daniele Palmas <dnlplm@gmail.com> Subject: [PATCH net-next v4 0/3] add tx packets aggregation to ethtool and rmnet Date: Wed, 11 Jan 2023 14:05:17 +0100 Message-Id: <20230111130520.483222-1-dnlplm@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	add tx packets aggregation to ethtool and rmnet \| expand [net-next,v4,0/3] add tx packets aggregation to ethtool and rmnet [net-next,v4,1/3] ethtool: add tx aggregation parameters [net-next,v4,2/3] net: qualcomm: rmnet: add tx packets aggregation [net-next,v4,3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation

Daniele Palmas Jan. 11, 2023, 1:05 p.m. UTC

Hello maintainers and all,

this patchset implements tx qmap packets aggregation in rmnet and generic
ethtool support for that.

Some low-cat Thread-x based modems are not capable of properly reaching the maximum
allowed throughput both in tx and rx during a bidirectional test if tx packets
aggregation is not enabled.

I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
(50Mbps/150Mbps max throughput). What is actually happening is pictured at
https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view

Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.

The same scenario with tx aggregation enabled is pictured at
https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
showing a regular graph.

This issue does not happen with high-cat modems (e.g. SDX20), or at least it
does not happen at the throughputs I'm able to test currently: maybe the same
could happen when moving close to the maximum rates supported by those modems.
Anyway, having the tx aggregation enabled should not hurt.

The first attempt to solve this issue was in qmi_wwan qmap implementation,
see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/

However, it turned out that rmnet was a better candidate for the implementation.

Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
not sure if I got their advice right, but this patchset add also generic ethtool
support for tx aggregation.

The patches have been tested mainly against an MDM9207 based modem through USB
and SDX55 through PCI (MHI).

v2 should address the comments highlighted in the review: the implementation is
still in rmnet, due to Subash's request of keeping tx aggregation there.

v3 fixes ethtool-netlink.rst content out of table bounds and a W=1 build warning
for patch 2.

v4 solves a race related to egress_agg_params.

Daniele Palmas (3):
ethtool: add tx aggregation parameters
net: qualcomm: rmnet: add tx packets aggregation
net: qualcomm: rmnet: add ethtool support for configuring tx
aggregation

Documentation/networking/ethtool-netlink.rst | 17 ++
.../ethernet/qualcomm/rmnet/rmnet_config.c | 5 +
.../ethernet/qualcomm/rmnet/rmnet_config.h | 20 ++
.../ethernet/qualcomm/rmnet/rmnet_handlers.c | 18 +-
.../net/ethernet/qualcomm/rmnet/rmnet_map.h | 6 +
.../ethernet/qualcomm/rmnet/rmnet_map_data.c | 191 ++++++++++++++++++
.../net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 54 ++++-
.../net/ethernet/qualcomm/rmnet/rmnet_vnd.h | 1 +
include/linux/ethtool.h | 12 +-
include/uapi/linux/ethtool_netlink.h | 3 +
net/ethtool/coalesce.c | 22 +-
11 files changed, 342 insertions(+), 7 deletions(-)

Alexander Duyck Jan. 12, 2023, 10:59 p.m. UTC | #1

On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> Hello maintainers and all,
> 
> this patchset implements tx qmap packets aggregation in rmnet and generic
> ethtool support for that.
> 
> Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> allowed throughput both in tx and rx during a bidirectional test if tx packets
> aggregation is not enabled.

One question I would have about this is if you are making use of Byte
Queue Limits and netdev_xmit_more at all? I know for high speed devices
most of us added support for xmit_more because PCIe bandwidth was
limiting when we were writing the Tx ring indexes/doorbells with every
packet. To overcome that we added netdev_xmit_more which told us when
the Qdisc had more packets to give us. This allowed us to better
utilize the PCIe bus by bursting packets through without adding
additional latency.

> I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> 
> Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> 
> The same scenario with tx aggregation enabled is pictured at
> https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> showing a regular graph.
> 
> This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> does not happen at the throughputs I'm able to test currently: maybe the same
> could happen when moving close to the maximum rates supported by those modems.
> Anyway, having the tx aggregation enabled should not hurt.
> 
> The first attempt to solve this issue was in qmi_wwan qmap implementation,
> see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> 
> However, it turned out that rmnet was a better candidate for the implementation.
> 
> Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> not sure if I got their advice right, but this patchset add also generic ethtool
> support for tx aggregation.

I have concerns about this essentially moving queueing disciplines down
into the device. The idea of doing Tx aggregation seems like something
that should be done with the qdisc rather than the device driver.
Otherwise we are looking at having multiple implementations of this
aggregation floating around. It seems like it would make more sense to
have this batching happen at the qdisc layer, and then the qdisc layer
would pass down a batch of frames w/ xmit_more set to indicate it is
flushing the batch.

Daniele Palmas Jan. 13, 2023, 3:43 p.m. UTC | #2

Hello Alexander,

Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
<alexander.duyck@gmail.com> ha scritto:
>
> On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > Hello maintainers and all,
> >
> > this patchset implements tx qmap packets aggregation in rmnet and generic
> > ethtool support for that.
> >
> > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > aggregation is not enabled.
>
> One question I would have about this is if you are making use of Byte
> Queue Limits and netdev_xmit_more at all? I know for high speed devices
> most of us added support for xmit_more because PCIe bandwidth was
> limiting when we were writing the Tx ring indexes/doorbells with every
> packet. To overcome that we added netdev_xmit_more which told us when
> the Qdisc had more packets to give us. This allowed us to better
> utilize the PCIe bus by bursting packets through without adding
> additional latency.
>

no, I was not aware of BQL: this development has been basically
modelled on what other mobile broadband drivers do (e.g.
cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
not using BQL.

If I understand properly documentation

rmnet0/queues/tx-0/byte_queue_limits/limit_max

would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.

But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
in combination with the bytes limit: at least the first one is
mandatory, since the modem can't receive more than a certain number
(this is a variable value depending on the modem model and is
collected through userspace tools).

ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
that tx aggregation has been enabled by the userspace tool managing
the qmi requests, otherwise no aggregation should be performed.

> > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> >
> > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> >
> > The same scenario with tx aggregation enabled is pictured at
> > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > showing a regular graph.
> >
> > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > does not happen at the throughputs I'm able to test currently: maybe the same
> > could happen when moving close to the maximum rates supported by those modems.
> > Anyway, having the tx aggregation enabled should not hurt.
> >
> > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> >
> > However, it turned out that rmnet was a better candidate for the implementation.
> >
> > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > not sure if I got their advice right, but this patchset add also generic ethtool
> > support for tx aggregation.
>
> I have concerns about this essentially moving queueing disciplines down
> into the device. The idea of doing Tx aggregation seems like something
> that should be done with the qdisc rather than the device driver.
> Otherwise we are looking at having multiple implementations of this
> aggregation floating around. It seems like it would make more sense to
> have this batching happen at the qdisc layer, and then the qdisc layer
> would pass down a batch of frames w/ xmit_more set to indicate it is
> flushing the batch.

Honestly, I'm not expert enough to give a reliable opinion about this.

I feel like these settings are more related to the hardware, requiring
also a configuration on the hardware itself done by the user, so
ethtool would seem to me a good choice, but I may be biased since I
did this development :-)

Thanks,
Daniele

Alexander Duyck Jan. 13, 2023, 4:16 p.m. UTC | #3

On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello Alexander,
>
> Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
> <alexander.duyck@gmail.com> ha scritto:
> >
> > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > > Hello maintainers and all,
> > >
> > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > ethtool support for that.
> > >
> > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > aggregation is not enabled.
> >
> > One question I would have about this is if you are making use of Byte
> > Queue Limits and netdev_xmit_more at all? I know for high speed devices
> > most of us added support for xmit_more because PCIe bandwidth was
> > limiting when we were writing the Tx ring indexes/doorbells with every
> > packet. To overcome that we added netdev_xmit_more which told us when
> > the Qdisc had more packets to give us. This allowed us to better
> > utilize the PCIe bus by bursting packets through without adding
> > additional latency.
> >
>
> no, I was not aware of BQL: this development has been basically
> modelled on what other mobile broadband drivers do (e.g.
> cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
> not using BQL.
>
> If I understand properly documentation
>
> rmnet0/queues/tx-0/byte_queue_limits/limit_max
>
> would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.

Yes the general idea is that you end up targeting the upper limit for
how many frames can be sent in a single burst.

> But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
> and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
> in combination with the bytes limit: at least the first one is
> mandatory, since the modem can't receive more than a certain number
> (this is a variable value depending on the modem model and is
> collected through userspace tools).

In terms of MAX_FRAMES there isn't necessarily anything like that, but
at the same time it isn't something that is already controlled by the
netdev itself by using the netif_stop_queue or netif_stop_subqueue
when there isn't space to store another frame. As such most devices
control this by just manipulating their descriptor ring size via
"ethtool -G <dev> tx <N>"

As far as the TIME_USECS that is something I tried to propose a decade
ago and was essentially given a hard "NAK" before xmit_more was
introduced. We shouldn't be adding latency when we don't need to.
Between GSO and xmit_more you should find that the network stack
itself will naturally want to give you larger bursts of frames with
xmit_more set. In addition, adding latency can mess with certain TCP
algorithms and cause problematic behaviors.

> ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> that tx aggregation has been enabled by the userspace tool managing
> the qmi requests, otherwise no aggregation should be performed.

Is there a specific reason why you wouldn't want to take advantage of
aggregation that is already provided by the stack in the form of
things such as GSO and the qdisc layer? I know most of the high speed
NICs are always making use of xmit_more since things like GSO can take
advantage of it to increase the throughput. Enabling BQL is a way of
taking that one step further and enabling the non-GSO cases.

> > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> > >
> > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> > >
> > > The same scenario with tx aggregation enabled is pictured at
> > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > > showing a regular graph.
> > >
> > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > > does not happen at the throughputs I'm able to test currently: maybe the same
> > > could happen when moving close to the maximum rates supported by those modems.
> > > Anyway, having the tx aggregation enabled should not hurt.
> > >
> > > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> > >
> > > However, it turned out that rmnet was a better candidate for the implementation.
> > >
> > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > > not sure if I got their advice right, but this patchset add also generic ethtool
> > > support for tx aggregation.
> >
> > I have concerns about this essentially moving queueing disciplines down
> > into the device. The idea of doing Tx aggregation seems like something
> > that should be done with the qdisc rather than the device driver.
> > Otherwise we are looking at having multiple implementations of this
> > aggregation floating around. It seems like it would make more sense to
> > have this batching happen at the qdisc layer, and then the qdisc layer
> > would pass down a batch of frames w/ xmit_more set to indicate it is
> > flushing the batch.
>
> Honestly, I'm not expert enough to give a reliable opinion about this.
>
> I feel like these settings are more related to the hardware, requiring
> also a configuration on the hardware itself done by the user, so
> ethtool would seem to me a good choice, but I may be biased since I
> did this development :-)

Yeah, I get that. I went through something similar when I had
submitted a patch to defer Tx tail writes in order to try and improve
the PCIe throughput. I would be open to revisiting this if we gave
xmit_more and BQL a try and it doesn't take care of this, but from my
past experience odds are the combination will likely resolve most of
what you are seeing without adding additional latency. At a minimum
the xmit_more should show a significant improvement in Tx throughput
just from the benefits of GSO sending bursts of frames through that
can easily be batched.

Dave Taht Jan. 13, 2023, 4:57 p.m. UTC | #4

Dear Alexander:

Thank you for taking a look at this thread. In earlier tests, the
aggregation was indeed an improvement over the default behavior,
but the amount of data that ended up living in the device was best
case, 150ms, and worst case, 25 seconds, in the flent tcp_nup case. A
better test would be the staggered start tcp_4up_squarewave test which
would show only one tcp out of the 4, able to start with the current
overall structure of this portion of the stack, once that buffer gets
filled.

The shiny new rust based Crusader test (
https://github.com/Zoxc/crusader ) has also now got a nice staggered
start test that shows this problem in more detail.

I was so peeved about 4g/5g behaviors like this in general that I was
going to dedicate a blog entry to it; this recent rant only touches
upon the real problem with engineering to the test, here:

https://blog.cerowrt.org/post/speedtests/

Unfortunately your post is more about leveraging xmit_more - which is
good, but the BQL structure for ethernet, itself leverages the concept
of a completion interrupt (when the packets actually hit the air or
wire), which I dearly wish was some previously unknown, single line
hack to the driver or message in the usb API for 4g/5g etc.

On Fri, Jan 13, 2023 at 8:17 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Fri, Jan 13, 2023 at 7:50 AM Daniele Palmas <dnlplm@gmail.com> wrote:
> >
> > Hello Alexander,
> >
> > Il giorno ven 13 gen 2023 alle ore 00:00 Alexander H Duyck
> > <alexander.duyck@gmail.com> ha scritto:
> > >
> > > On Wed, 2023-01-11 at 14:05 +0100, Daniele Palmas wrote:
> > > > Hello maintainers and all,
> > > >
> > > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > > ethtool support for that.
> > > >
> > > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > > aggregation is not enabled.
> > >
> > > One question I would have about this is if you are making use of Byte
> > > Queue Limits and netdev_xmit_more at all? I know for high speed devices
> > > most of us added support for xmit_more because PCIe bandwidth was
> > > limiting when we were writing the Tx ring indexes/doorbells with every
> > > packet. To overcome that we added netdev_xmit_more which told us when
> > > the Qdisc had more packets to give us. This allowed us to better
> > > utilize the PCIe bus by bursting packets through without adding
> > > additional latency.
> > >
> >
> > no, I was not aware of BQL: this development has been basically
> > modelled on what other mobile broadband drivers do (e.g.
> > cdc_mbim/cdc_ncm, Qualcomm downstream rmnet implementation), that are
> > not using BQL.
> >
> > If I understand properly documentation
> >
> > rmnet0/queues/tx-0/byte_queue_limits/limit_max
> >
> > would be the candidate for replacing ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES.
>
> Yes the general idea is that you end up targeting the upper limit for
> how many frames can be sent in a single burst.
>
> > But I can't find anything for ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES
> > and ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS, something that should work
> > in combination with the bytes limit: at least the first one is
> > mandatory, since the modem can't receive more than a certain number
> > (this is a variable value depending on the modem model and is
> > collected through userspace tools).
>
> In terms of MAX_FRAMES there isn't necessarily anything like that, but
> at the same time it isn't something that is already controlled by the
> netdev itself by using the netif_stop_queue or netif_stop_subqueue
> when there isn't space to store another frame. As such most devices
> control this by just manipulating their descriptor ring size via
> "ethtool -G <dev> tx <N>"
>
> As far as the TIME_USECS that is something I tried to propose a decade
> ago and was essentially given a hard "NAK" before xmit_more was
> introduced. We shouldn't be adding latency when we don't need to.
> Between GSO and xmit_more you should find that the network stack
> itself will naturally want to give you larger bursts of frames with
> xmit_more set. In addition, adding latency can mess with certain TCP
> algorithms and cause problematic behaviors.
>
> > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > that tx aggregation has been enabled by the userspace tool managing
> > the qmi requests, otherwise no aggregation should be performed.
>
> Is there a specific reason why you wouldn't want to take advantage of
> aggregation that is already provided by the stack in the form of
> things such as GSO and the qdisc layer? I know most of the high speed
> NICs are always making use of xmit_more since things like GSO can take
> advantage of it to increase the throughput. Enabling BQL is a way of
> taking that one step further and enabling the non-GSO cases.
>
> > > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> > > >
> > > > Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> > > > in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> > > > tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
> > > >
> > > > The same scenario with tx aggregation enabled is pictured at
> > > > https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> > > > showing a regular graph.
> > > >
> > > > This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> > > > does not happen at the throughputs I'm able to test currently: maybe the same
> > > > could happen when moving close to the maximum rates supported by those modems.
> > > > Anyway, having the tx aggregation enabled should not hurt.
> > > >
> > > > The first attempt to solve this issue was in qmi_wwan qmap implementation,
> > > > see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
> > > >
> > > > However, it turned out that rmnet was a better candidate for the implementation.
> > > >
> > > > Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> > > > not sure if I got their advice right, but this patchset add also generic ethtool
> > > > support for tx aggregation.
> > >
> > > I have concerns about this essentially moving queueing disciplines down
> > > into the device. The idea of doing Tx aggregation seems like something
> > > that should be done with the qdisc rather than the device driver.
> > > Otherwise we are looking at having multiple implementations of this
> > > aggregation floating around. It seems like it would make more sense to
> > > have this batching happen at the qdisc layer, and then the qdisc layer
> > > would pass down a batch of frames w/ xmit_more set to indicate it is
> > > flushing the batch.
> >
> > Honestly, I'm not expert enough to give a reliable opinion about this.
> >
> > I feel like these settings are more related to the hardware, requiring
> > also a configuration on the hardware itself done by the user, so
> > ethtool would seem to me a good choice, but I may be biased since I
> > did this development :-)
>
> Yeah, I get that. I went through something similar when I had
> submitted a patch to defer Tx tail writes in order to try and improve
> the PCIe throughput. I would be open to revisiting this if we gave
> xmit_more and BQL a try and it doesn't take care of this, but from my
> past experience odds are the combination will likely resolve most of
> what you are seeing without adding additional latency. At a minimum
> the xmit_more should show a significant improvement in Tx throughput
> just from the benefits of GSO sending bursts of frames through that
> can easily be batched.

Jakub Kicinski Jan. 13, 2023, 7:48 p.m. UTC | #5

On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote:
> > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > that tx aggregation has been enabled by the userspace tool managing
> > the qmi requests, otherwise no aggregation should be performed.  
> 
> Is there a specific reason why you wouldn't want to take advantage of
> aggregation that is already provided by the stack in the form of
> things such as GSO and the qdisc layer? I know most of the high speed
> NICs are always making use of xmit_more since things like GSO can take
> advantage of it to increase the throughput. Enabling BQL is a way of
> taking that one step further and enabling the non-GSO cases.

The patches had been applied last night by DaveM but FWIW I think
Alex's idea is quite interesting. Even without BQL I believe you'd 
get xmit_more set within a single GSO super-frame. TCP sends data
in chunks of 64kB, and you're only aggregating 32kB. If we could
get the same effect without any added latency and user configuration
that'd be great.

patchwork-bot+netdevbpf@kernel.org Jan. 13, 2023, 9:20 p.m. UTC | #6

Hello:

This series was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Wed, 11 Jan 2023 14:05:17 +0100 you wrote:
> Hello maintainers and all,
> 
> this patchset implements tx qmap packets aggregation in rmnet and generic
> ethtool support for that.
> 
> Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> allowed throughput both in tx and rx during a bidirectional test if tx packets
> aggregation is not enabled.
> 
> [...]

Here is the summary with links:
  - [net-next,v4,1/3] ethtool: add tx aggregation parameters
    https://git.kernel.org/netdev/net-next/c/31de2842399a
  - [net-next,v4,2/3] net: qualcomm: rmnet: add tx packets aggregation
    https://git.kernel.org/netdev/net-next/c/64b5d1f8f2d1
  - [net-next,v4,3/3] net: qualcomm: rmnet: add ethtool support for configuring tx aggregation
    https://git.kernel.org/netdev/net-next/c/db8a563a9d90

You are awesome, thank you!

Daniele Palmas Jan. 15, 2023, 12:15 p.m. UTC | #7

Hello Jakub and Alexander,

Il giorno ven 13 gen 2023 alle ore 20:48 Jakub Kicinski
<kuba@kernel.org> ha scritto:
>
> On Fri, 13 Jan 2023 08:16:48 -0800 Alexander Duyck wrote:
> > > ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES works also as a way to determine
> > > that tx aggregation has been enabled by the userspace tool managing
> > > the qmi requests, otherwise no aggregation should be performed.
> >
> > Is there a specific reason why you wouldn't want to take advantage of
> > aggregation that is already provided by the stack in the form of
> > things such as GSO and the qdisc layer? I know most of the high speed
> > NICs are always making use of xmit_more since things like GSO can take
> > advantage of it to increase the throughput. Enabling BQL is a way of
> > taking that one step further and enabling the non-GSO cases.
>
> The patches had been applied last night by DaveM but FWIW I think
> Alex's idea is quite interesting. Even without BQL I believe you'd
> get xmit_more set within a single GSO super-frame. TCP sends data
> in chunks of 64kB, and you're only aggregating 32kB. If we could
> get the same effect without any added latency and user configuration
> that'd be great.

Thanks for the hints, I'll explore xmit_more usage and try to gather
some numbers to compare the two solutions.

Regards,
Daniele

[net-next,v4,0/3] add tx packets aggregation to ethtool and rmnet

Message

Comments