mbox series

[net-next,v2,0/3] add tx packets aggregation to ethtool and rmnet

Message ID 20221130124616.1500643-1-dnlplm@gmail.com (mailing list archive)
Headers show
Series add tx packets aggregation to ethtool and rmnet | expand

Message

Daniele Palmas Nov. 30, 2022, 12:46 p.m. UTC
Hello maintainers and all,

this patchset implements tx qmap packets aggregation in rmnet and generic
ethtool support for that.

Some low-cat Thread-x based modems are not capable of properly reaching the maximum
allowed throughput both in tx and rx during a bidirectional test if tx packets
aggregation is not enabled.

I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
(50Mbps/150Mbps max throughput). What is actually happening is pictured at
https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view

Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.

The same scenario with tx aggregation enabled is pictured at
https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
showing a regular graph.

This issue does not happen with high-cat modems (e.g. SDX20), or at least it
does not happen at the throughputs I'm able to test currently: maybe the same
could happen when moving close to the maximum rates supported by those modems.
Anyway, having the tx aggregation enabled should not hurt.

The first attempt to solve this issue was in qmi_wwan qmap implementation,
see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/

However, it turned out that rmnet was a better candidate for the implementation.

Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
not sure if I got their advice right, but this patchset add also generic ethtool
support for tx aggregation.

The patches have been tested mainly against an MDM9207 based modem through USB
and SDX55 through PCI (MHI).

v2 should address the comments highlighted in the review: the implementation is
still in rmnet, due to Subash's request of keeping tx aggregation there.

Thanks,
Daniele

Daniele Palmas (3):
  ethtool: add tx aggregation parameters
  net: qualcomm: rmnet: add tx packets aggregation
  net: qualcomm: rmnet: add ethtool support for configuring tx
    aggregation

 Documentation/networking/ethtool-netlink.rst  |  17 ++
 .../ethernet/qualcomm/rmnet/rmnet_config.c    |   5 +
 .../ethernet/qualcomm/rmnet/rmnet_config.h    |  20 ++
 .../ethernet/qualcomm/rmnet/rmnet_handlers.c  |  18 +-
 .../net/ethernet/qualcomm/rmnet/rmnet_map.h   |   6 +
 .../ethernet/qualcomm/rmnet/rmnet_map_data.c  | 191 ++++++++++++++++++
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.c   |  54 ++++-
 .../net/ethernet/qualcomm/rmnet/rmnet_vnd.h   |   1 +
 include/linux/ethtool.h                       |  12 +-
 include/uapi/linux/ethtool_netlink.h          |   3 +
 net/ethtool/coalesce.c                        |  22 +-
 11 files changed, 342 insertions(+), 7 deletions(-)

Comments

Dave Taht Nov. 30, 2022, 3:04 p.m. UTC | #1
On Wed, Nov 30, 2022 at 5:15 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello maintainers and all,
>
> this patchset implements tx qmap packets aggregation in rmnet and generic
> ethtool support for that.
>
> Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> allowed throughput both in tx and rx during a bidirectional test if tx packets
> aggregation is not enabled.
>
> I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view

Thank you for documenting which device this is. Is it still handing in
150ms of bufferbloat in good conditions,
and 25 seconds or so in bad?

> Testing with iperf TCP, when rx and tx flows are tested singularly there's no issue
> in tx and minor issues in rx (not able to reach max throughput). When there are concurrent
> tx and rx flows, tx throughput has an huge drop. rx a minor one, but still present.
>
> The same scenario with tx aggregation enabled is pictured at
> https://drive.google.com/file/d/1jcVIKNZD7K3lHtwKE5W02mpaloudYYih/view
> showing a regular graph.
>
> This issue does not happen with high-cat modems (e.g. SDX20), or at least it
> does not happen at the throughputs I'm able to test currently: maybe the same
> could happen when moving close to the maximum rates supported by those modems.
> Anyway, having the tx aggregation enabled should not hurt.
>
> The first attempt to solve this issue was in qmi_wwan qmap implementation,
> see the discussion at https://lore.kernel.org/netdev/20221019132503.6783-1-dnlplm@gmail.com/
>
> However, it turned out that rmnet was a better candidate for the implementation.
>
> Moreover, Greg and Jakub suggested also to use ethtool for the configuration:
> not sure if I got their advice right, but this patchset add also generic ethtool
> support for tx aggregation.
>
> The patches have been tested mainly against an MDM9207 based modem through USB
> and SDX55 through PCI (MHI).
>
> v2 should address the comments highlighted in the review: the implementation is
> still in rmnet, due to Subash's request of keeping tx aggregation there.
>
> Thanks,
> Daniele
>
> Daniele Palmas (3):
>   ethtool: add tx aggregation parameters
>   net: qualcomm: rmnet: add tx packets aggregation
>   net: qualcomm: rmnet: add ethtool support for configuring tx
>     aggregation
>
>  Documentation/networking/ethtool-netlink.rst  |  17 ++
>  .../ethernet/qualcomm/rmnet/rmnet_config.c    |   5 +
>  .../ethernet/qualcomm/rmnet/rmnet_config.h    |  20 ++
>  .../ethernet/qualcomm/rmnet/rmnet_handlers.c  |  18 +-
>  .../net/ethernet/qualcomm/rmnet/rmnet_map.h   |   6 +
>  .../ethernet/qualcomm/rmnet/rmnet_map_data.c  | 191 ++++++++++++++++++
>  .../net/ethernet/qualcomm/rmnet/rmnet_vnd.c   |  54 ++++-
>  .../net/ethernet/qualcomm/rmnet/rmnet_vnd.h   |   1 +
>  include/linux/ethtool.h                       |  12 +-
>  include/uapi/linux/ethtool_netlink.h          |   3 +
>  net/ethtool/coalesce.c                        |  22 +-
>  11 files changed, 342 insertions(+), 7 deletions(-)
>
> --
> 2.37.1
>
Daniele Palmas Dec. 1, 2022, 10:48 a.m. UTC | #2
Hello Dave,

Il giorno mer 30 nov 2022 alle ore 16:04 Dave Taht
<dave.taht@gmail.com> ha scritto:
>
> On Wed, Nov 30, 2022 at 5:15 AM Daniele Palmas <dnlplm@gmail.com> wrote:
> >
> > Hello maintainers and all,
> >
> > this patchset implements tx qmap packets aggregation in rmnet and generic
> > ethtool support for that.
> >
> > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > aggregation is not enabled.
> >
> > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
>
> Thank you for documenting which device this is. Is it still handing in
> 150ms of bufferbloat in good conditions,
> and 25 seconds or so in bad?
>

New Flent test results available at
https://drive.google.com/drive/folders/1-rpeuM2Dg9rVdYCP0M84K4Ook5kcZTWc?usp=share_link

From what I can understand, it seems to me a bit better, but not
completely sure how much is directly related to the changes of v2.

Regards,
Daniele
Dave Taht Dec. 1, 2022, 3:22 p.m. UTC | #3
On Thu, Dec 1, 2022 at 2:55 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Hello Dave,
>
> Il giorno mer 30 nov 2022 alle ore 16:04 Dave Taht
> <dave.taht@gmail.com> ha scritto:
> >
> > On Wed, Nov 30, 2022 at 5:15 AM Daniele Palmas <dnlplm@gmail.com> wrote:
> > >
> > > Hello maintainers and all,
> > >
> > > this patchset implements tx qmap packets aggregation in rmnet and generic
> > > ethtool support for that.
> > >
> > > Some low-cat Thread-x based modems are not capable of properly reaching the maximum
> > > allowed throughput both in tx and rx during a bidirectional test if tx packets
> > > aggregation is not enabled.
> > >
> > > I verified this problem with rmnet + qmi_wwan by using a MDM9207 Cat. 4 based modem
> > > (50Mbps/150Mbps max throughput). What is actually happening is pictured at
> > > https://drive.google.com/file/d/1gSbozrtd9h0X63i6vdkNpN68d-9sg8f9/view
> >
> > Thank you for documenting which device this is. Is it still handing in
> > 150ms of bufferbloat in good conditions,
> > and 25 seconds or so in bad?
> >
>
> New Flent test results available at
> https://drive.google.com/drive/folders/1-rpeuM2Dg9rVdYCP0M84K4Ook5kcZTWc?usp=share_link
>
> From what I can understand, it seems to me a bit better, but not
> completely sure how much is directly related to the changes of v2.

Anything that can shorten the round trips being experienced by the
flows such as yours and wedge more data packets in would be a
goodness.

A switch to using the "BBR" congestion controller might be a vast
improvement over what I figure is your default of cubic. On both
server and client....

modprobe tcp_bbr
sysctl -w net.ipv4.tcp_congestion_control=bbr

And rerun your tests.

Over the years we've come up with multiple mechanisms for fixing this
on other network subsystems (bql, aql, tsq, etc, etc), but something
that could track a "completion" - where we knew the packet had finally
got "in the air" and out of the device would be best. It's kind of
hard to convince everyone to replace their congestion controller.

Any chance you could share these results with the maker of the test
tool you are primarily using? I'd like to think that they would
embrace adding some sort of simultaneous latency measurement to it,
and I would hope that that would help bring more minds to figuring out
how to solve the 25!! seconds worth of delay that can accumulate on
this path through the kernel

https://github.com/Zoxc/crusader is a nice, emerging tool, that runs
on all OSes (it's in rust) that does a network test more right and is
simpler than flent, by far. I especially like the staggered start
feature in it, as it would clearly show a second flow, trying to
start, with this kind of latency already in the system, failing
miserably.

My comments on your patchset are not a blocker to being accepted!

If you want to get a grip on how much better things "could be", slap
an instance of cake bandwidth 40mbit on the up, and 80mbit on the down
on your "good" network setup, watch the latencies fall to nearly zero,
watch a second flow starting late grab it's fair share of the
bandwidth almost immediately, and dream of the metaverse actually
working right...

> Regards,
> Daniele
Daniele Palmas Dec. 2, 2022, 4:56 p.m. UTC | #4
Il giorno gio 1 dic 2022 alle ore 16:22 Dave Taht
<dave.taht@gmail.com> ha scritto:
>
> A switch to using the "BBR" congestion controller might be a vast
> improvement over what I figure is your default of cubic. On both
> server and client....
>
> modprobe tcp_bbr
> sysctl -w net.ipv4.tcp_congestion_control=bbr
>
> And rerun your tests.
>

Done, results available at
https://drive.google.com/drive/folders/14CRs3ugyX_ANpUhMRJCH3ORy-jR3mbIS?usp=sharing

> Any chance you could share these results with the maker of the test
> tool you are primarily using?

Mmmhh... Not easy, since I don't have a direct point of contact.

> My comments on your patchset are not a blocker to being accepted!
>

Thanks, I must admit that I don't have much knowledge on this topic,
will try to educate myself.

Regards,
Daniele
Dave Taht Dec. 2, 2022, 5:27 p.m. UTC | #5
On Fri, Dec 2, 2022 at 9:03 AM Daniele Palmas <dnlplm@gmail.com> wrote:
>
> Il giorno gio 1 dic 2022 alle ore 16:22 Dave Taht
> <dave.taht@gmail.com> ha scritto:
> >
> > A switch to using the "BBR" congestion controller might be a vast
> > improvement over what I figure is your default of cubic. On both
> > server and client....
> >
> > modprobe tcp_bbr
> > sysctl -w net.ipv4.tcp_congestion_control=bbr
> >
> > And rerun your tests.
> >
>
> Done, results available at
> https://drive.google.com/drive/folders/14CRs3ugyX_ANpUhMRJCH3ORy-jR3mbIS?usp=sharing

Mildly better on the up with bbr. till cracks 24 seconds on the rrul
up + down test. bbr on the server also you are using might actually
work out (which of course involves convincing the world to switch to
bbr for cell transfers) - could you try bbr on the server also?

I keep hoping that somewhere in the 4g or 5g usb stack there is some
way to tag packets and tell the OS that the given
packet had actually been sent out over the air. Getting the max
packets in the card at any rate good or bad, down below 20ms would be
marvelous. the 5G folk keep boasting about 1ms mac "latency" and never
seem to have noticed how bad it can get if you actually send real data
over a usb bus to it.

> > Any chance you could share these results with the maker of the test
> > tool you are primarily using?
>
> Mmmhh... Not easy, since I don't have a direct point of contact.

Mind if I rant about this somewhere?

> > My comments on your patchset are not a blocker to being accepted!
> >
>
> Thanks, I must admit that I don't have much knowledge on this topic,
> will try to educate myself.

thx! while I have a lot of entertaining videos about this kind of
stuff the starting point I recommend most is van jacobson's
ietf talk, with a fountain analogy:

video:https://archive.org/details/video1_20191129
slides: https://pollere.net/Pdfdocs/codel.ietf-tsv.jul12.pdf

The abstraction I keep hoping we find (somehow) - for lte/4g/5g - to
me would look a lot like BQL:

https://lwn.net/Articles/469652/
>
> Regards,
> Daniele