diff mbox series

[net-next,1/3] ethtool: add tx aggregation parameters

Message ID 20221109180249.4721-2-dnlplm@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series add tx packets aggregation to ethtool and rmnet | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1896 this patch: 1896
netdev/cc_maintainers warning 6 maintainers not CCed: huangguangbin2@huawei.com chenhao288@hisilicon.com linux@rempel-privat.de linux-doc@vger.kernel.org moshet@nvidia.com andrew@lunn.ch
netdev/build_clang success Errors and warnings before: 605 this patch: 605
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1997 this patch: 1997
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 103 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Daniele Palmas Nov. 9, 2022, 6:02 p.m. UTC
Add the following ethtool tx aggregation parameters:

ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
Maximum size of an aggregated block of frames in tx.

ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
Maximum number of frames that can be aggregated into a block.

ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
Time in usecs after the first packet arrival in an aggregated
block for the block to be sent.

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
---
 Documentation/networking/ethtool-netlink.rst |  6 ++++++
 include/linux/ethtool.h                      | 12 ++++++++++-
 include/uapi/linux/ethtool_netlink.h         |  3 +++
 net/ethtool/coalesce.c                       | 22 ++++++++++++++++++--
 4 files changed, 40 insertions(+), 3 deletions(-)

Comments

Jakub Kicinski Nov. 11, 2022, 5:07 p.m. UTC | #1
On Wed,  9 Nov 2022 19:02:47 +0100 Daniele Palmas wrote:
> Add the following ethtool tx aggregation parameters:
> 
> ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
> Maximum size of an aggregated block of frames in tx.

perhaps s/size/bytes/ ? Or just mention bytes in the doc? I think it's
the first argument in coalescing expressed in bytes, so to avoid
confusion we should state that clearly.

> ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
> Maximum number of frames that can be aggregated into a block.
> 
> ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
> Time in usecs after the first packet arrival in an aggregated
> block for the block to be sent.

Can we add this info to the ethtool-netlink.rst doc?

Can we also add a couple of sentences describing what aggregation is?
Something about copying the packets into a contiguous buffer to submit
as one large IO operation, usually found on USB adapters?

People with very different device needs will read this and may pattern
match the parameters to something completely different like just
delaying ringing the doorbell. So even if things seem obvious they are
worth documenting.

> diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> index d578b8bcd8a4..a6f115867648 100644
> --- a/Documentation/networking/ethtool-netlink.rst
> +++ b/Documentation/networking/ethtool-netlink.rst
> @@ -1001,6 +1001,9 @@ Kernel response contents:
>    ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
>    ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
>    ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
> +  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE``      u32     max aggr packets size, Tx
> +  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES``    u32     max aggr packets, Tx
> +  ``ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME``    u32     time (us), aggr pkts, Tx

nit: perhaps move _aggr before the specifics? e.g.

 ETHTOOL_A_COALESCE_TX_AGGR_MAX_SIZE
 ETHTOOL_A_COALESCE_TX_AGGR_USECS_TIME

FWIW I find that the easiest way to do whole-sale renames in a series
is to generate the patches as .patch files, run sed on those, and apply
them back on a fresh branch. Rebasing is a PITA with renames.

Other then these nit picks - looks very reasonable :)
Daniele Palmas Nov. 11, 2022, 9:51 p.m. UTC | #2
Hello Jakub,

Il giorno ven 11 nov 2022 alle ore 18:07 Jakub Kicinski
<kuba@kernel.org> ha scritto:
>
> On Wed,  9 Nov 2022 19:02:47 +0100 Daniele Palmas wrote:
> > Add the following ethtool tx aggregation parameters:
> >
> > ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
> > Maximum size of an aggregated block of frames in tx.
>
> perhaps s/size/bytes/ ? Or just mention bytes in the doc? I think it's
> the first argument in coalescing expressed in bytes, so to avoid
> confusion we should state that clearly.

Right, better to have the word bytes to make it clear.

>
> > ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
> > Maximum number of frames that can be aggregated into a block.
> >
> > ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
> > Time in usecs after the first packet arrival in an aggregated
> > block for the block to be sent.
>
> Can we add this info to the ethtool-netlink.rst doc?
>
> Can we also add a couple of sentences describing what aggregation is?
> Something about copying the packets into a contiguous buffer to submit
> as one large IO operation, usually found on USB adapters?
>
> People with very different device needs will read this and may pattern
> match the parameters to something completely different like just
> delaying ringing the doorbell. So even if things seem obvious they are
> worth documenting.
>

Sure, I'll take care of documenting this better.

> > diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
> > index d578b8bcd8a4..a6f115867648 100644
> > --- a/Documentation/networking/ethtool-netlink.rst
> > +++ b/Documentation/networking/ethtool-netlink.rst
> > @@ -1001,6 +1001,9 @@ Kernel response contents:
> >    ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
> >    ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
> >    ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
> > +  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE``      u32     max aggr packets size, Tx
> > +  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES``    u32     max aggr packets, Tx
> > +  ``ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME``    u32     time (us), aggr pkts, Tx
>
> nit: perhaps move _aggr before the specifics? e.g.
>
>  ETHTOOL_A_COALESCE_TX_AGGR_MAX_SIZE
>  ETHTOOL_A_COALESCE_TX_AGGR_USECS_TIME
>

Ack.

> FWIW I find that the easiest way to do whole-sale renames in a series
> is to generate the patches as .patch files, run sed on those, and apply
> them back on a fresh branch. Rebasing is a PITA with renames.
>

Thanks for the advice.

> Other then these nit picks - looks very reasonable :)

Thanks for the review!

Regards,
Daniele
Gal Pressman Nov. 13, 2022, 9:48 a.m. UTC | #3
On 11/11/2022 19:07, Jakub Kicinski wrote:
> On Wed,  9 Nov 2022 19:02:47 +0100 Daniele Palmas wrote:
>> Add the following ethtool tx aggregation parameters:
>>
>> ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
>> Maximum size of an aggregated block of frames in tx.
> perhaps s/size/bytes/ ? Or just mention bytes in the doc? I think it's
> the first argument in coalescing expressed in bytes, so to avoid
> confusion we should state that clearly.
>
>> ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
>> Maximum number of frames that can be aggregated into a block.
>>
>> ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
>> Time in usecs after the first packet arrival in an aggregated
>> block for the block to be sent.
> Can we add this info to the ethtool-netlink.rst doc?
>
> Can we also add a couple of sentences describing what aggregation is?
> Something about copying the packets into a contiguous buffer to submit
> as one large IO operation, usually found on USB adapters?
>
> People with very different device needs will read this and may pattern
> match the parameters to something completely different like just
> delaying ringing the doorbell. So even if things seem obvious they are
> worth documenting.

Seems like I am these people, I was under the impression this is kind of
a threshold for xmit more or something?
What is this contiguous buffer? Isn't this the same as TX copybreak? TX
copybreak for multiple packets?
Daniele Palmas Nov. 14, 2022, 10:06 a.m. UTC | #4
Hello Gal,

Il giorno dom 13 nov 2022 alle ore 10:48 Gal Pressman <gal@nvidia.com>
ha scritto:
>
> On 11/11/2022 19:07, Jakub Kicinski wrote:
> > On Wed,  9 Nov 2022 19:02:47 +0100 Daniele Palmas wrote:
> >> Add the following ethtool tx aggregation parameters:
> >>
> >> ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
> >> Maximum size of an aggregated block of frames in tx.
> > perhaps s/size/bytes/ ? Or just mention bytes in the doc? I think it's
> > the first argument in coalescing expressed in bytes, so to avoid
> > confusion we should state that clearly.
> >
> >> ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
> >> Maximum number of frames that can be aggregated into a block.
> >>
> >> ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
> >> Time in usecs after the first packet arrival in an aggregated
> >> block for the block to be sent.
> > Can we add this info to the ethtool-netlink.rst doc?
> >
> > Can we also add a couple of sentences describing what aggregation is?
> > Something about copying the packets into a contiguous buffer to submit
> > as one large IO operation, usually found on USB adapters?
> >
> > People with very different device needs will read this and may pattern
> > match the parameters to something completely different like just
> > delaying ringing the doorbell. So even if things seem obvious they are
> > worth documenting.
>
> Seems like I am these people, I was under the impression this is kind of
> a threshold for xmit more or something?
> What is this contiguous buffer?

I would like to explain the issue I'm trying to solve.

I'm using an USB modem that is driven by qmi_wwan which creates a
netdevice: on top of this the rmnet module creates another netdevice,
needed since packets sent to the modem needs to follow the qmap
protocol both for multiplexing and performance.

Without tx packets aggregation each tx packet sent to the rmnet
netdevice makes an URB to be sent through qmi_wwan/usbnet, so that
there is a 1:1 relation between a qmap packet and an URB.

So far, this has not been an issue, but I've run into a family of
modems for which this behavior does not work well, preventing the
modem from reaching the maximum throughput both in rx and tx (an
example of the issue is described at
https://lore.kernel.org/netdev/CAGRyCJEkTHpLVsD9zTzSQp8d98SBM24nyqq-HA0jbvHUre+C4g@mail.gmail.com/
)

Tx packets aggregation allows to overcome this issue, so that a single
URB holds N qmap packets, reducing URBs frequency.

The maximum number of allowed packets in a single URB and the maximum
size of the URB are dictated by the modem through the qmi control
protocol: the values returned by the modem are then configured in the
driver with the new ethtool parameters.

> Isn't this the same as TX copybreak? TX
> copybreak for multiple packets?

I tried looking at how tx copybreak works to understand your comment,
but I could not find any useful document. Probably my fault, but can
you please point me to something I can read?

Thanks,
Daniele
Dave Taht Nov. 14, 2022, 10:45 a.m. UTC | #5
On Mon, Nov 14, 2022 at 2:31 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello Gal,
>
> Il giorno dom 13 nov 2022 alle ore 10:48 Gal Pressman <gal@nvidia.com>
> ha scritto:
> >
> > On 11/11/2022 19:07, Jakub Kicinski wrote:
> > > On Wed,  9 Nov 2022 19:02:47 +0100 Daniele Palmas wrote:
> > >> Add the following ethtool tx aggregation parameters:
> > >>
> > >> ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE
> > >> Maximum size of an aggregated block of frames in tx.
> > > perhaps s/size/bytes/ ? Or just mention bytes in the doc? I think it's
> > > the first argument in coalescing expressed in bytes, so to avoid
> > > confusion we should state that clearly.
> > >
> > >> ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES
> > >> Maximum number of frames that can be aggregated into a block.
> > >>
> > >> ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME
> > >> Time in usecs after the first packet arrival in an aggregated
> > >> block for the block to be sent.
> > > Can we add this info to the ethtool-netlink.rst doc?
> > >
> > > Can we also add a couple of sentences describing what aggregation is?
> > > Something about copying the packets into a contiguous buffer to submit
> > > as one large IO operation, usually found on USB adapters?
> > >
> > > People with very different device needs will read this and may pattern
> > > match the parameters to something completely different like just
> > > delaying ringing the doorbell. So even if things seem obvious they are
> > > worth documenting.
> >
> > Seems like I am these people, I was under the impression this is kind of
> > a threshold for xmit more or something?
> > What is this contiguous buffer?
>
> I would like to explain the issue I'm trying to solve.
>
> I'm using an USB modem that is driven by qmi_wwan which creates a
> netdevice: on top of this the rmnet module creates another netdevice,
> needed since packets sent to the modem needs to follow the qmap
> protocol both for multiplexing and performance.
>
> Without tx packets aggregation each tx packet sent to the rmnet
> netdevice makes an URB to be sent through qmi_wwan/usbnet, so that
> there is a 1:1 relation between a qmap packet and an URB.
>
> So far, this has not been an issue, but I've run into a family of
> modems for which this behavior does not work well, preventing the
> modem from reaching the maximum throughput both in rx and tx (an
> example of the issue is described at
> https://lore.kernel.org/netdev/CAGRyCJEkTHpLVsD9zTzSQp8d98SBM24nyqq-HA0jbvHUre+C4g@mail.gmail.com/
> )
>
> Tx packets aggregation allows to overcome this issue, so that a single
> URB holds N qmap packets, reducing URBs frequency.

While I understand the use case... it's generally been my hope we got
to a BQL-like mechanism for
4G and 5G that keeps the latency under control. Right now, that can be
really, really, really miserable -
measured in *seconds* - and adding in packet aggregation naively is
what messed up Wifi for the
past decade. Please lose 8 minutes of your life to this (hilarious)
explanation of why aggregation can be bad.

https://www.youtube.com/watch?v=Rb-UnHDw02o&t=1560s

So given a choice between being able to drive the modem at the maximum
rate in a testbed...
or having it behave well at all possible (and highly variable) egress
rates, I would so love for more to focus on the latter problem than
the former, at whatever levels and layers in the stack it takes.

As a test, what happens on the flent "rrul" test, before and after
this patch? Under good wireless conditions, and bad?

flent -H server -t my-test-conditions -x --socket-stats rrul
flent -H server -t my-test-conditions -x --socket-stats
--test-parameter=upload_streams=4 tcp_nup

I have servers for that all over the world
{de,london,fremont,dallas,singapore,toronto,}.starlink.taht.net

> The maximum number of allowed packets in a single URB and the maximum
> size of the URB are dictated by the modem through the qmi control
> protocol: the values returned by the modem are then configured in the
> driver with the new ethtool parameters.
>
> > Isn't this the same as TX copybreak? TX
> > copybreak for multiple packets?
>
> I tried looking at how tx copybreak works to understand your comment,
> but I could not find any useful document. Probably my fault, but can
> you please point me to something I can read?
>
> Thanks,
> Daniele
Jakub Kicinski Nov. 15, 2022, 12:42 a.m. UTC | #6
On Mon, 14 Nov 2022 11:06:19 +0100 Daniele Palmas wrote:
> > Isn't this the same as TX copybreak? TX
> > copybreak for multiple packets?  
> 
> I tried looking at how tx copybreak works to understand your comment,
> but I could not find any useful document. Probably my fault, but can
> you please point me to something I can read?

FWIW it's not exactly copy break, as it applies to all packets.
But there is indeed an extra copy.

Daniele's explanation is pretty solid, USB devices very often try 
to pack multiple packets into a single URB for better perf. IIUC Linux
drivers implement the feature on the Rx side, but fall short of doing
the same on the Tx side, because there's no API to control it.

I've seen the same thing in the WiFi USB devices I worked on in my
youth :)
Gal Pressman Nov. 15, 2022, 10:59 a.m. UTC | #7
On 15/11/2022 02:42, Jakub Kicinski wrote:
> On Mon, 14 Nov 2022 11:06:19 +0100 Daniele Palmas wrote:
>>> Isn't this the same as TX copybreak? TX
>>> copybreak for multiple packets?  
>> I tried looking at how tx copybreak works to understand your comment,
>> but I could not find any useful document. Probably my fault, but can
>> you please point me to something I can read?
> FWIW it's not exactly copy break, as it applies to all packets.
> But there is indeed an extra copy.
>
> Daniele's explanation is pretty solid, USB devices very often try 
> to pack multiple packets into a single URB for better perf. IIUC Linux
> drivers implement the feature on the Rx side, but fall short of doing
> the same on the Tx side, because there's no API to control it.
>
> I've seen the same thing in the WiFi USB devices I worked on in my
> youth :)

Thanks for the explanation.
How would this apply to a pci netdev driver?
Daniele Palmas Nov. 15, 2022, 11:51 a.m. UTC | #8
Hello Dave,

Il giorno lun 14 nov 2022 alle ore 11:46 Dave Taht
<dave.taht@gmail.com> ha scritto:
> > Tx packets aggregation allows to overcome this issue, so that a single
> > URB holds N qmap packets, reducing URBs frequency.
>
> While I understand the use case... it's generally been my hope we got
> to a BQL-like mechanism for
> 4G and 5G that keeps the latency under control. Right now, that can be
> really, really, really miserable -
> measured in *seconds* - and adding in packet aggregation naively is
> what messed up Wifi for the
> past decade. Please lose 8 minutes of your life to this (hilarious)
> explanation of why aggregation can be bad.
>
> https://www.youtube.com/watch?v=Rb-UnHDw02o&t=1560s
>

Nice video and really instructive :-)

> So given a choice between being able to drive the modem at the maximum
> rate in a testbed...
> or having it behave well at all possible (and highly variable) egress
> rates, I would so love for more to focus on the latter problem than
> the former, at whatever levels and layers in the stack it takes.
>

I get your point, but here it's not just a testbed issue, since I
think that the huge tx drop due to a concurrent rx can happen also in
real life scenarios.

Additionally, it seems that Qualcomm modems are meant to be used in
this way: as far as I know all QC downstream kernel versions have this
kind of feature in the rmnet code.

I think that this can be seen as adding one more choice for the user:
by default tx aggregation in rmnet would be disabled, so no one should
notice this change and suffer from latencies different than the ones
the current rmnet driver already has.

But for those that are affected by the same bug I'm facing or are
interested in a different use-case in which tx aggregation makes
sense, this feature can help.

Hope that this makes sense.

> As a test, what happens on the flent "rrul" test, before and after
> this patch? Under good wireless conditions, and bad?
>
> flent -H server -t my-test-conditions -x --socket-stats rrul
> flent -H server -t my-test-conditions -x --socket-stats
> --test-parameter=upload_streams=4 tcp_nup
>
> I have servers for that all over the world
> {de,london,fremont,dallas,singapore,toronto,}.starlink.taht.net
>

I've uploaded some results at
https://drive.google.com/drive/folders/1-HjhyJaN4oWRNv8P8C__KD9-V-IoBwbL?usp=sharing

The good network condition has been simulated through a callbox
connected to LAN (there are also a few pictures of the throughput on
the callbox side while performing the tests with tx aggregation
enabled/disabled).

Thanks,
Daniele

> > The maximum number of allowed packets in a single URB and the maximum
> > size of the URB are dictated by the modem through the qmi control
> > protocol: the values returned by the modem are then configured in the
> > driver with the new ethtool parameters.
> >
> > > Isn't this the same as TX copybreak? TX
> > > copybreak for multiple packets?
> >
> > I tried looking at how tx copybreak works to understand your comment,
> > but I could not find any useful document. Probably my fault, but can
> > you please point me to something I can read?
> >
> > Thanks,
> > Daniele
>
>
>
> --
> This song goes out to all the folk that thought Stadia would work:
> https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> Dave Täht CEO, TekLibre, LLC
Dave Taht Nov. 15, 2022, 3:27 p.m. UTC | #9
On Tue, Nov 15, 2022 at 3:57 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello Dave,
>
> Il giorno lun 14 nov 2022 alle ore 11:46 Dave Taht
> <dave.taht@gmail.com> ha scritto:
> > > Tx packets aggregation allows to overcome this issue, so that a single
> > > URB holds N qmap packets, reducing URBs frequency.
> >
> > While I understand the use case... it's generally been my hope we got
> > to a BQL-like mechanism for
> > 4G and 5G that keeps the latency under control. Right now, that can be
> > really, really, really miserable -
> > measured in *seconds* - and adding in packet aggregation naively is
> > what messed up Wifi for the
> > past decade. Please lose 8 minutes of your life to this (hilarious)
> > explanation of why aggregation can be bad.
> >
> > https://www.youtube.com/watch?v=Rb-UnHDw02o&t=1560s
> >
>
> Nice video and really instructive :-)

Thank you. I wish more folk working on the wifi7 firmware had exposed
per station queues so they could have been gang scheduled by linux for
the RU facility.

I targeted a more recent video (with the help of the MIT juggling
club!) at a wider audience.

https://www.youtube.com/watch?v=TWViGcBlnm0&t=120s

> > So given a choice between being able to drive the modem at the maximum
> > rate in a testbed...
> > or having it behave well at all possible (and highly variable) egress
> > rates, I would so love for more to focus on the latter problem than
> > the former, at whatever levels and layers in the stack it takes.
> >
>
> I get your point, but here it's not just a testbed issue, since I
> think that the huge tx drop due to a concurrent rx can happen also in
> real life scenarios.
>
> Additionally, it seems that Qualcomm modems are meant to be used in
> this way: as far as I know all QC downstream kernel versions have this
> kind of feature in the rmnet code.

"feature" is not exactly the word I'd use...

>
> I think that this can be seen as adding one more choice for the user:
> by default tx aggregation in rmnet would be disabled, so no one should
> notice this change and suffer from latencies different than the ones
> the current rmnet driver already has.
>
> But for those that are affected by the same bug I'm facing or are
> interested in a different use-case in which tx aggregation makes
> sense, this feature can help.
>
> Hope that this makes sense.
>
> > As a test, what happens on the flent "rrul" test, before and after
> > this patch? Under good wireless conditions, and bad?
> >
> > flent -H server -t my-test-conditions -x --socket-stats rrul
> > flent -H server -t my-test-conditions -x --socket-stats
> > --test-parameter=upload_streams=4 tcp_nup
> >
> > I have servers for that all over the world
> > {de,london,fremont,dallas,singapore,toronto,}.starlink.taht.net
> >
>
> I've uploaded some results at
> https://drive.google.com/drive/folders/1-HjhyJaN4oWRNv8P8C__KD9-V-IoBwbL?usp=sharing

Thank you very much for performing those tests. Under bad conditions,
without tx aggregation, network delays grew
on the rrul test to over 15 seconds, and with tx aggregation, to 25
seconds. Since most of our protocols tend to break
at more than 1 second of induced latency, I do hope more can lean in
to find ways of resolving these problems.

Under good conditions, your result was almost reasonable - only 150ms
of induced delay on the rrul test, and aggregation worked a lot better
than non-aggregation. That said, over here are the results we
currently get out of some forms of fq_codel for wifi, at any rate,
only 8ms, and 40ms with others:
https://forum.openwrt.org/t/reducing-multiplexing-latencies-still-further-in-wifi/133605/42

Btw, you can plot these tests using the flent-gui various ways. I will
attach a couple in the hope they get through
this listserv.

We've had a team for two years now, trying to make 5G (and starlink)
behave better, leveraging active measurements
and cake. There are three very usable scripts here:
https://forum.openwrt.org/t/cake-w-adaptive-bandwidth/135379/2 that
while targetted at openwrt, also work on generic linux.

I'd love to know what happens for you on those. Something more elegant
would be nice.

> The good network condition has been simulated through a callbox
> connected to LAN (there are also a few pictures of the throughput on
> the callbox side while performing the tests with tx aggregation
> enabled/disabled).

It doesn't look like your software measures the same things flent does.

>
> Thanks,
> Daniele
>
> > > The maximum number of allowed packets in a single URB and the maximum
> > > size of the URB are dictated by the modem through the qmi control
> > > protocol: the values returned by the modem are then configured in the
> > > driver with the new ethtool parameters.
> > >
> > > > Isn't this the same as TX copybreak? TX
> > > > copybreak for multiple packets?
> > >
> > > I tried looking at how tx copybreak works to understand your comment,
> > > but I could not find any useful document. Probably my fault, but can
> > > you please point me to something I can read?
> > >
> > > Thanks,
> > > Daniele
> >
> >
> >
> > --
> > This song goes out to all the folk that thought Stadia would work:
> > https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-6981366665607352320-FXtz
> > Dave Täht CEO, TekLibre, LLC
Jakub Kicinski Nov. 15, 2022, 4:21 p.m. UTC | #10
On Tue, 15 Nov 2022 12:59:47 +0200 Gal Pressman wrote:
> Thanks for the explanation.
> How would this apply to a pci netdev driver?

I don't think it would. I've only seen it on USB devices.
We should add a note to this effect to the documentation.
Perhaps other funny buses may use it but PCIe remains efficient 
for packet sizes below standard MTUs so unlikely we'd need this
sort of aggregation.
diff mbox series

Patch

diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index d578b8bcd8a4..a6f115867648 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -1001,6 +1001,9 @@  Kernel response contents:
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
   ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
   ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE``      u32     max aggr packets size, Tx
+  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME``    u32     time (us), aggr pkts, Tx
   ===========================================  ======  =======================
 
 Attributes are only included in reply if their value is not zero or the
@@ -1052,6 +1055,9 @@  Request contents:
   ``ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL``  u32     rate sampling interval
   ``ETHTOOL_A_COALESCE_USE_CQE_TX``            bool    timer reset mode, Tx
   ``ETHTOOL_A_COALESCE_USE_CQE_RX``            bool    timer reset mode, Rx
+  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE``      u32     max aggr packets size, Tx
+  ``ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES``    u32     max aggr packets, Tx
+  ``ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME``    u32     time (us), aggr pkts, Tx
   ===========================================  ======  =======================
 
 Request is rejected if it attributes declared as unsupported by driver (i.e.
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 99dc7bfbcd3c..3726db470247 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -203,6 +203,9 @@  __ethtool_get_link_ksettings(struct net_device *dev,
 struct kernel_ethtool_coalesce {
 	u8 use_cqe_mode_tx;
 	u8 use_cqe_mode_rx;
+	u32 tx_max_aggr_size;
+	u32 tx_max_aggr_frames;
+	u32 tx_usecs_aggr_time;
 };
 
 /**
@@ -246,7 +249,10 @@  bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 #define ETHTOOL_COALESCE_RATE_SAMPLE_INTERVAL	BIT(21)
 #define ETHTOOL_COALESCE_USE_CQE_RX		BIT(22)
 #define ETHTOOL_COALESCE_USE_CQE_TX		BIT(23)
-#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(23, 0)
+#define ETHTOOL_COALESCE_TX_MAX_AGGR_SIZE	BIT(24)
+#define ETHTOOL_COALESCE_TX_MAX_AGGR_FRAMES	BIT(25)
+#define ETHTOOL_COALESCE_TX_USECS_AGGR_TIME	BIT(26)
+#define ETHTOOL_COALESCE_ALL_PARAMS		GENMASK(26, 0)
 
 #define ETHTOOL_COALESCE_USECS						\
 	(ETHTOOL_COALESCE_RX_USECS | ETHTOOL_COALESCE_TX_USECS)
@@ -274,6 +280,10 @@  bool ethtool_convert_link_mode_to_legacy_u32(u32 *legacy_u32,
 	 ETHTOOL_COALESCE_RATE_SAMPLE_INTERVAL)
 #define ETHTOOL_COALESCE_USE_CQE					\
 	(ETHTOOL_COALESCE_USE_CQE_RX | ETHTOOL_COALESCE_USE_CQE_TX)
+#define ETHTOOL_COALESCE_TX_AGGR		\
+	(ETHTOOL_COALESCE_TX_MAX_AGGR_SIZE |	\
+	 ETHTOOL_COALESCE_TX_MAX_AGGR_FRAMES |	\
+	 ETHTOOL_COALESCE_TX_USECS_AGGR_TIME)
 
 #define ETHTOOL_STAT_NOT_SET	(~0ULL)
 
diff --git a/include/uapi/linux/ethtool_netlink.h b/include/uapi/linux/ethtool_netlink.h
index bb57084ac524..08872c8ea0d6 100644
--- a/include/uapi/linux/ethtool_netlink.h
+++ b/include/uapi/linux/ethtool_netlink.h
@@ -397,6 +397,9 @@  enum {
 	ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL,	/* u32 */
 	ETHTOOL_A_COALESCE_USE_CQE_MODE_TX,		/* u8 */
 	ETHTOOL_A_COALESCE_USE_CQE_MODE_RX,		/* u8 */
+	ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE,		/* u32 */
+	ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES,		/* u32 */
+	ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME,		/* u32 */
 
 	/* add new constants above here */
 	__ETHTOOL_A_COALESCE_CNT,
diff --git a/net/ethtool/coalesce.c b/net/ethtool/coalesce.c
index 487bdf345541..014a7a4f73f2 100644
--- a/net/ethtool/coalesce.c
+++ b/net/ethtool/coalesce.c
@@ -105,7 +105,10 @@  static int coalesce_reply_size(const struct ethnl_req_info *req_base,
 	       nla_total_size(sizeof(u32)) +	/* _TX_MAX_FRAMES_HIGH */
 	       nla_total_size(sizeof(u32)) +	/* _RATE_SAMPLE_INTERVAL */
 	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_TX */
-	       nla_total_size(sizeof(u8));	/* _USE_CQE_MODE_RX */
+	       nla_total_size(sizeof(u8)) +	/* _USE_CQE_MODE_RX */
+	       nla_total_size(sizeof(u32)) +	/* _TX_MAX_AGGR_SIZE */
+	       nla_total_size(sizeof(u32)) +	/* _TX_MAX_AGGR_FRAMES */
+	       nla_total_size(sizeof(u32));	/* _TX_USECS_AGGR_TIME */
 }
 
 static bool coalesce_put_u32(struct sk_buff *skb, u16 attr_type, u32 val,
@@ -180,7 +183,13 @@  static int coalesce_fill_reply(struct sk_buff *skb,
 	    coalesce_put_bool(skb, ETHTOOL_A_COALESCE_USE_CQE_MODE_TX,
 			      kcoal->use_cqe_mode_tx, supported) ||
 	    coalesce_put_bool(skb, ETHTOOL_A_COALESCE_USE_CQE_MODE_RX,
-			      kcoal->use_cqe_mode_rx, supported))
+			      kcoal->use_cqe_mode_rx, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE,
+			     kcoal->tx_max_aggr_size, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES,
+			     kcoal->tx_max_aggr_frames, supported) ||
+	    coalesce_put_u32(skb, ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME,
+			     kcoal->tx_usecs_aggr_time, supported))
 		return -EMSGSIZE;
 
 	return 0;
@@ -227,6 +236,9 @@  const struct nla_policy ethnl_coalesce_set_policy[] = {
 	[ETHTOOL_A_COALESCE_RATE_SAMPLE_INTERVAL] = { .type = NLA_U32 },
 	[ETHTOOL_A_COALESCE_USE_CQE_MODE_TX]	= NLA_POLICY_MAX(NLA_U8, 1),
 	[ETHTOOL_A_COALESCE_USE_CQE_MODE_RX]	= NLA_POLICY_MAX(NLA_U8, 1),
+	[ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES] = { .type = NLA_U32 },
+	[ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME] = { .type = NLA_U32 },
 };
 
 int ethnl_set_coalesce(struct sk_buff *skb, struct genl_info *info)
@@ -321,6 +333,12 @@  int ethnl_set_coalesce(struct sk_buff *skb, struct genl_info *info)
 			tb[ETHTOOL_A_COALESCE_USE_CQE_MODE_TX], &mod);
 	ethnl_update_u8(&kernel_coalesce.use_cqe_mode_rx,
 			tb[ETHTOOL_A_COALESCE_USE_CQE_MODE_RX], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_max_aggr_size,
+			 tb[ETHTOOL_A_COALESCE_TX_MAX_AGGR_SIZE], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_max_aggr_frames,
+			 tb[ETHTOOL_A_COALESCE_TX_MAX_AGGR_FRAMES], &mod);
+	ethnl_update_u32(&kernel_coalesce.tx_usecs_aggr_time,
+			 tb[ETHTOOL_A_COALESCE_TX_USECS_AGGR_TIME], &mod);
 	ret = 0;
 	if (!mod)
 		goto out_ops;