mbox series

[RFC,net-next,0/5] Suspend IRQs during preferred busy poll

Message ID 20240812125717.413108-1-jdamato@fastly.com (mailing list archive)
Headers show
Series Suspend IRQs during preferred busy poll | expand

Message

Joe Damato Aug. 12, 2024, 12:57 p.m. UTC
Greetings:

Martin Karsten (CC'd) and I have been collaborating on some ideas about
ways of reducing tail latency when using epoll-based busy poll and we'd
love to get feedback from the list on the code in this series. This is
the idea I mentioned at netdev conf, for those who were there. Barring
any major issues, we hope to submit this officially shortly after RFC.

The basic idea for suspending IRQs in this manner was described in an
earlier paper presented at Sigmetrics 2024 [1].

Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for
epoll_params") introduced the ability to enable or disable preferred
busy poll mode on a specific epoll context using an ioctl
(EPIOCSPARAMS).

This series extends preferred busy poll mode by adding a sysfs parameter,
irq_suspend_timeout, which when used in combination with preferred busy
poll suspends device IRQs up to irq_suspend_timeout nanoseconds.

Important call outs:
  - Enabling per epoll-context preferred busy poll will now effectively
    lead to a nonblocking iteration through napi_busy_loop, even when
    busy_poll_usecs is 0. See patch 4.

  - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa:
    microchip: ksz9477: split half-duplex monitoring function"),  but
    may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate
    data-race of busy_poll_usecs") picked up by the vfs folks makes its way
    into net-next.

  - In the future, time permitting, I hope to enable support for
    napi_defer_hard_irqs, gro_flush_timeout (introduced in commit
    6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and
    irq_suspend_timeout (introduced in this series) on a per-NAPI basis
    (presumably via netdev-genl).

~ Description of the changes

The overall idea is that IRQ suspension is introduced via a sysfs
parameter which controls the maximum time that IRQs can be suspended.

Here's how it is intended to work:
  - An administrator sets the existing sysfs parameters for
    defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.

  - An administrator sets the new sysfs parameter irq_suspend_timeout
    to a larger value than gro-timeout to enable IRQ suspension.

  - The user application issues the existing epoll ioctl to set the
    prefer_busy_poll flag on the epoll context.

  - The user application then calls epoll_wait to busy poll for network
    events, as it normally would.

  - If epoll_wait returns events to userland, IRQ are suspended for the
    duration of irq_suspend_timeout.

  - If epoll_wait finds no events and the thread is about to go to
    sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is
    resumed.

As long as epoll_wait is retrieving events, IRQs (and softirq
processing) for the NAPI being polled remain disabled. Unless IRQ
suspension is continued by subsequent calls to epoll_wait, it
automatically times out after the irq_suspend_timeout timer expires.

When network traffic reduces, eventually a busy poll loop in the kernel
will retrieve no data. When this occurs, regular deferral using
gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular
deferral is also immediately re-enabled when the epoll context is
destroyed.

~ Benchmark configs & descriptions

These changes were benchmarked with memcached [2] using the
benchmarking tool mutilate [3].

To facilitate benchmarking, a small patch [4] was applied to
memcached 1.6.29 (the latest memcached release as of this RFC) to allow
setting per-epoll context preferred busy poll and other settings
via environment variables.

Multiple scenarios were benchmarked as described below
and the scripts used for producing these results can be found on
github [5].

(note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID
by passing -N to memcached):

  - base: Other than NAPI-based traffic splitting, no other options are
    enabled.
  - busy:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to 200,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
      busy_poll_budget = 64, prefer_busy_poll = true)
  - deferX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
  - suspendX:
    - set defer_hard_irqs to 100
    - set gro_flush_timeout to X,000
    - set irq_suspend_timeout to 20,000,000
    - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
      busy_poll_budget = 64, prefer_busy_poll = true)

~ Benchmark results

Tested on:

Single socket AMD EPYC 7662 64-Core Processor
Hyperthreading disabled
4 NUMA Zones (NPS=4)
16 CPUs per NUMA zone (64 cores total)
2 x Dual port 100gbps Mellanox Technologies ConnectX-5 Ex EN NIC

The test machine is configured such that a single interface has 8 RX
queues. The queues' IRQs and memcached are pinned to CPUs that are
NUMA-local to the interface which is under test. memcached binds to the
ipv4 address on the configured interface.

The NIC's interrupts coalescing configuration are left at boot-time
defaults.

The overall takeaway from the results below is that the new mechanism
(suspend20, see below) results in reduced 99th percentile latency and
increased QPS in the MAX QPS case (compared to the other cases), and
reduced latency in the lower QPS cases for comparable CPU usage to the
base case (and less CPU than the busy case).

base
  load     qps  avglat  95%lat  99%lat     cpu
  200K  199982     109     225     385      30
  400K  400054     138     262     676      44
  600K  599968     165     396     737      64
  800K  800002     353    1136    2098      83
 1000K  964960    3202    5556    7003      98
   MAX  957274    4255    5526    6843     100

busy
  load     qps  avglat  95%lat  99%lat     cpu
  200K  199936     101     239     287      57
  400K  399795      81     230     302      83
  600K  599797      65     169     264      95
  800K  799789      67     145     221      99
 1000K 1000135      97     186     287     100
   MAX 1079228    3752    7481   12634      98

defer20
  load     qps  avglat  95%lat  99%lat     cpu
  200K  200052      60     130     156      28
  400K  399797      67     140     176      49
  600K  600049      94     189     483      68
  800K  800106     246     959    2201      88
 1000K  857377    4377    5674    5830     100
   MAX  974672    4162    5454    5815     100

defer200
  load     qps  avglat  95%lat  99%lat     cpu
  200K  200029     165     258     316      18
  400K  399978     183     280     340      32
  600K  599818     205     310     367      46
  800K  799869     265     439     829      73
 1000K  995961    2307    5163    7027      98
   MAX 1050680    3837    5020    5596     100

suspend20
  load     qps  avglat  95%lat  99%lat     cpu
  200K  199968      58     128     161      31
  400K  400191      61     135     175      51
  600K  599872      67     142     196      66
  800K  800050      78     153     220      82
 1000K  999638     101     194     292      91
   MAX 1144308    3596    3961    4155     100

suspend200
  load     qps  avglat  95%lat  99%lat     cpu
  200K  199973     149     251     313      20
  400K  399957     154     270     331      35
  600K  599878     157     284     351      51
  800K  800091     158     293     359      65
 1000K 1000399     173     311     393      85
   MAX 1128033    3636    4210    4381     100

Thanks,
Martin and Joe

[1]: https://doi.org/10.1145/3626780
[2]: https://github.com/memcached/memcached/blob/master/doc/napi_ids.txt
[3]: https://github.com/leverich/mutilate
[4]: https://raw.githubusercontent.com/martinkarsten/irqsuspend/main/patches/memcached.patch
[5]: https://github.com/martinkarsten/irqsuspend

Thanks,
Martin and Joe

Martin Karsten (5):
  net: Add sysfs parameter irq_suspend_timeout
  net: Suspend softirq when prefer_busy_poll is set
  net: Add control functions for irq suspension
  eventpoll: Trigger napi_busy_loop, if prefer_busy_poll is set
  eventpoll: Control irq suspension for prefer_busy_poll

 Documentation/networking/napi.rst |  3 ++
 fs/eventpoll.c                    | 26 +++++++++++++--
 include/linux/netdevice.h         |  2 ++
 include/net/busy_poll.h           |  3 ++
 net/core/dev.c                    | 55 +++++++++++++++++++++++++++----
 net/core/net-sysfs.c              | 18 ++++++++++
 6 files changed, 98 insertions(+), 9 deletions(-)

--
2.25.1

Comments

Stanislav Fomichev Aug. 12, 2024, 8:19 p.m. UTC | #1
On 08/12, Joe Damato wrote:
> Greetings:
> 
> Martin Karsten (CC'd) and I have been collaborating on some ideas about
> ways of reducing tail latency when using epoll-based busy poll and we'd
> love to get feedback from the list on the code in this series. This is
> the idea I mentioned at netdev conf, for those who were there. Barring
> any major issues, we hope to submit this officially shortly after RFC.
> 
> The basic idea for suspending IRQs in this manner was described in an
> earlier paper presented at Sigmetrics 2024 [1].

Let me explicitly call out the paper. Very nice analysis!

> Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for
> epoll_params") introduced the ability to enable or disable preferred
> busy poll mode on a specific epoll context using an ioctl
> (EPIOCSPARAMS).
> 
> This series extends preferred busy poll mode by adding a sysfs parameter,
> irq_suspend_timeout, which when used in combination with preferred busy
> poll suspends device IRQs up to irq_suspend_timeout nanoseconds.
> 
> Important call outs:
>   - Enabling per epoll-context preferred busy poll will now effectively
>     lead to a nonblocking iteration through napi_busy_loop, even when
>     busy_poll_usecs is 0. See patch 4.
> 
>   - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa:
>     microchip: ksz9477: split half-duplex monitoring function"),  but
>     may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate
>     data-race of busy_poll_usecs") picked up by the vfs folks makes its way
>     into net-next.
> 
>   - In the future, time permitting, I hope to enable support for
>     napi_defer_hard_irqs, gro_flush_timeout (introduced in commit
>     6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and
>     irq_suspend_timeout (introduced in this series) on a per-NAPI basis
>     (presumably via netdev-genl).
> 
> ~ Description of the changes
> 
> The overall idea is that IRQ suspension is introduced via a sysfs
> parameter which controls the maximum time that IRQs can be suspended.
> 
> Here's how it is intended to work:
>   - An administrator sets the existing sysfs parameters for
>     defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> 
>   - An administrator sets the new sysfs parameter irq_suspend_timeout
>     to a larger value than gro-timeout to enable IRQ suspension.

Can you expand more on what's the problem with the existing gro_flush_timeout?
Is it defer_hard_irqs_count? Or you want a separate timeout only for the
perfer_busy_poll case(why?)? Because looking at the first two patches,
you essentially replace all usages of gro_flush_timeout with a new variable
and I don't see how it helps.

Maybe expand more on what code paths are we trying to improve? Existing
busy polling code is not super readable, so would be nice to simplify
it a bit in the process (if possible) instead of adding one more tunable.

>   - The user application issues the existing epoll ioctl to set the
>     prefer_busy_poll flag on the epoll context.
> 
>   - The user application then calls epoll_wait to busy poll for network
>     events, as it normally would.
> 
>   - If epoll_wait returns events to userland, IRQ are suspended for the
>     duration of irq_suspend_timeout.
> 
>   - If epoll_wait finds no events and the thread is about to go to
>     sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is
>     resumed.
> 
> As long as epoll_wait is retrieving events, IRQs (and softirq
> processing) for the NAPI being polled remain disabled. Unless IRQ
> suspension is continued by subsequent calls to epoll_wait, it
> automatically times out after the irq_suspend_timeout timer expires.
> 
> When network traffic reduces, eventually a busy poll loop in the kernel
> will retrieve no data. When this occurs, regular deferral using
> gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular
> deferral is also immediately re-enabled when the epoll context is
> destroyed.
> 
> ~ Benchmark configs & descriptions
> 
> These changes were benchmarked with memcached [2] using the
> benchmarking tool mutilate [3].
> 
> To facilitate benchmarking, a small patch [4] was applied to
> memcached 1.6.29 (the latest memcached release as of this RFC) to allow
> setting per-epoll context preferred busy poll and other settings
> via environment variables.
> 
> Multiple scenarios were benchmarked as described below
> and the scripts used for producing these results can be found on
> github [5].
> 
> (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID
> by passing -N to memcached):
> 
>   - base: Other than NAPI-based traffic splitting, no other options are
>     enabled.
>   - busy:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to 200,000
>     - enable busy poll via the existing ioctl (busy_poll_usecs = 64,
>       busy_poll_budget = 64, prefer_busy_poll = true)
>   - deferX:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to X,000

[..]

>   - suspendX:
>     - set defer_hard_irqs to 100
>     - set gro_flush_timeout to X,000
>     - set irq_suspend_timeout to 20,000,000
>     - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>       busy_poll_budget = 64, prefer_busy_poll = true)

What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
to busy_poll sysctl value?
Martin Karsten Aug. 12, 2024, 9:46 p.m. UTC | #2
On 2024-08-12 16:19, Stanislav Fomichev wrote:
> On 08/12, Joe Damato wrote:
>> Greetings:
>>
>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>> ways of reducing tail latency when using epoll-based busy poll and we'd
>> love to get feedback from the list on the code in this series. This is
>> the idea I mentioned at netdev conf, for those who were there. Barring
>> any major issues, we hope to submit this officially shortly after RFC.
>>
>> The basic idea for suspending IRQs in this manner was described in an
>> earlier paper presented at Sigmetrics 2024 [1].
> 
> Let me explicitly call out the paper. Very nice analysis!

Thank you!

[snip]

>> Here's how it is intended to work:
>>    - An administrator sets the existing sysfs parameters for
>>      defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>
>>    - An administrator sets the new sysfs parameter irq_suspend_timeout
>>      to a larger value than gro-timeout to enable IRQ suspension.
> 
> Can you expand more on what's the problem with the existing gro_flush_timeout?
> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> perfer_busy_poll case(why?)? Because looking at the first two patches,
> you essentially replace all usages of gro_flush_timeout with a new variable
> and I don't see how it helps.

gro-flush-timeout (in combination with defer-hard-irqs) is the default 
irq deferral mechanism and as such, always active when configured. Its 
static periodic softirq processing leads to a situation where:

- A long gro-flush-timeout causes high latencies when load is 
sufficiently below capacity, or

- a short gro-flush-timeout causes overhead when softirq execution 
asynchronously competes with application processing at high load.

The shortcomings of this are documented (to some extent) by our 
experiments. See defer20 working well at low load, but having problems 
at high load, while defer200 having higher latency at low load.

irq-suspend-timeout is only active when an application uses 
prefer-busy-polling and in that case, produces a nice alternating 
pattern of application processing and networking processing (similar to 
what we describe in the paper). This then works well with both low and 
high load.

> Maybe expand more on what code paths are we trying to improve? Existing
> busy polling code is not super readable, so would be nice to simplify
> it a bit in the process (if possible) instead of adding one more tunable.

There are essentially three possible loops for network processing:

1) hardirq -> softirq -> napi poll; this is the baseline functionality

2) timer -> softirq -> napi poll; this is deferred irq processing scheme 
with the shortcomings described above

3) epoll -> busy-poll -> napi poll

If a system is configured for 1), not much can be done, as it is 
difficult to interject anything into this loop without adding state and 
side effects. This is what we tried for the paper, but it ended up being 
a hack.

If however the system is configured for irq deferral, Loops 2) and 3) 
"wrestle" with each other for control. Injecting the larger 
irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in 
favour of Loop 3) and creates the nice pattern describe above.

[snip]

>>    - suspendX:
>>      - set defer_hard_irqs to 100
>>      - set gro_flush_timeout to X,000
>>      - set irq_suspend_timeout to 20,000,000
>>      - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>        busy_poll_budget = 64, prefer_busy_poll = true)
> 
> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
> to busy_poll sysctl value?

Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll 
(sysctl) or busy_poll_usecs is nonzero. However, this might lead to 
busy-polling even when the application does not actually need or want 
it. Only one iteration through the busy loop is needed to make the new 
scheme work. Additional napi busy polling over and above is optional.

Thanks,
Martin
Stanislav Fomichev Aug. 12, 2024, 11:03 p.m. UTC | #3
On 08/12, Martin Karsten wrote:
> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > On 08/12, Joe Damato wrote:
> > > Greetings:
> > > 
> > > Martin Karsten (CC'd) and I have been collaborating on some ideas about
> > > ways of reducing tail latency when using epoll-based busy poll and we'd
> > > love to get feedback from the list on the code in this series. This is
> > > the idea I mentioned at netdev conf, for those who were there. Barring
> > > any major issues, we hope to submit this officially shortly after RFC.
> > > 
> > > The basic idea for suspending IRQs in this manner was described in an
> > > earlier paper presented at Sigmetrics 2024 [1].
> > 
> > Let me explicitly call out the paper. Very nice analysis!
> 
> Thank you!
> 
> [snip]
> 
> > > Here's how it is intended to work:
> > >    - An administrator sets the existing sysfs parameters for
> > >      defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> > > 
> > >    - An administrator sets the new sysfs parameter irq_suspend_timeout
> > >      to a larger value than gro-timeout to enable IRQ suspension.
> > 
> > Can you expand more on what's the problem with the existing gro_flush_timeout?
> > Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> > perfer_busy_poll case(why?)? Because looking at the first two patches,
> > you essentially replace all usages of gro_flush_timeout with a new variable
> > and I don't see how it helps.
> 
> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
> deferral mechanism and as such, always active when configured. Its static
> periodic softirq processing leads to a situation where:
> 
> - A long gro-flush-timeout causes high latencies when load is sufficiently
> below capacity, or
> 
> - a short gro-flush-timeout causes overhead when softirq execution
> asynchronously competes with application processing at high load.
> 
> The shortcomings of this are documented (to some extent) by our experiments.
> See defer20 working well at low load, but having problems at high load,
> while defer200 having higher latency at low load.
> 
> irq-suspend-timeout is only active when an application uses
> prefer-busy-polling and in that case, produces a nice alternating pattern of
> application processing and networking processing (similar to what we
> describe in the paper). This then works well with both low and high load.

So you only want it for the prefer-busy-pollingc case, makes sense. I was
a bit confused by the difference between defer200 and suspend200,
but now I see that defer200 does not enable busypoll.

I'm assuming that if you enable busypool in defer200 case, the numbers
should be similar to suspend200 (ignoring potentially affecting
non-busypolling queues due to higher gro_flush_timeout).

> > Maybe expand more on what code paths are we trying to improve? Existing
> > busy polling code is not super readable, so would be nice to simplify
> > it a bit in the process (if possible) instead of adding one more tunable.
> 
> There are essentially three possible loops for network processing:
> 
> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> 
> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> with the shortcomings described above
> 
> 3) epoll -> busy-poll -> napi poll
> 
> If a system is configured for 1), not much can be done, as it is difficult
> to interject anything into this loop without adding state and side effects.
> This is what we tried for the paper, but it ended up being a hack.
> 
> If however the system is configured for irq deferral, Loops 2) and 3)
> "wrestle" with each other for control. Injecting the larger
> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> of Loop 3) and creates the nice pattern describe above.

And you hit (2) when the epoll goes to sleep and/or when the userspace
isn't fast enough to keep up with the timer, presumably? I wonder
if need to use this opportunity and do proper API as Joe hints in the
cover letter. Something over netlink to say "I'm gonna busy-poll on
this queue / napi_id and with this timeout". And then we can essentially make
gro_flush_timeout per queue (and avoid
napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
too hacky already :-(

> [snip]
> 
> > >    - suspendX:
> > >      - set defer_hard_irqs to 100
> > >      - set gro_flush_timeout to X,000
> > >      - set irq_suspend_timeout to 20,000,000
> > >      - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
> > >        busy_poll_budget = 64, prefer_busy_poll = true)
> > 
> > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
> > to busy_poll sysctl value?
> 
> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
> busy-polling even when the application does not actually need or want it.
> Only one iteration through the busy loop is needed to make the new scheme
> work. Additional napi busy polling over and above is optional.

Ack, thanks, was trying to understand why not stay with
busy_poll_usecs=64 for consistency. But I guess you were just
trying to show that patch 4/5 works.
Martin Karsten Aug. 13, 2024, 12:04 a.m. UTC | #4
On 2024-08-12 19:03, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>> On 08/12, Joe Damato wrote:
>>>> Greetings:
>>>>
>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>>>> ways of reducing tail latency when using epoll-based busy poll and we'd
>>>> love to get feedback from the list on the code in this series. This is
>>>> the idea I mentioned at netdev conf, for those who were there. Barring
>>>> any major issues, we hope to submit this officially shortly after RFC.
>>>>
>>>> The basic idea for suspending IRQs in this manner was described in an
>>>> earlier paper presented at Sigmetrics 2024 [1].
>>>
>>> Let me explicitly call out the paper. Very nice analysis!
>>
>> Thank you!
>>
>> [snip]
>>
>>>> Here's how it is intended to work:
>>>>     - An administrator sets the existing sysfs parameters for
>>>>       defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>
>>>>     - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>       to a larger value than gro-timeout to enable IRQ suspension.
>>>
>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>> and I don't see how it helps.
>>
>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
>> deferral mechanism and as such, always active when configured. Its static
>> periodic softirq processing leads to a situation where:
>>
>> - A long gro-flush-timeout causes high latencies when load is sufficiently
>> below capacity, or
>>
>> - a short gro-flush-timeout causes overhead when softirq execution
>> asynchronously competes with application processing at high load.
>>
>> The shortcomings of this are documented (to some extent) by our experiments.
>> See defer20 working well at low load, but having problems at high load,
>> while defer200 having higher latency at low load.
>>
>> irq-suspend-timeout is only active when an application uses
>> prefer-busy-polling and in that case, produces a nice alternating pattern of
>> application processing and networking processing (similar to what we
>> describe in the paper). This then works well with both low and high load.
> 
> So you only want it for the prefer-busy-pollingc case, makes sense. I was
> a bit confused by the difference between defer200 and suspend200,
> but now I see that defer200 does not enable busypoll.
> 
> I'm assuming that if you enable busypool in defer200 case, the numbers
> should be similar to suspend200 (ignoring potentially affecting
> non-busypolling queues due to higher gro_flush_timeout).

defer200 + napi busy poll is essentially what we labelled "busy" and it 
does not perform as well, since it still suffers interference between 
application and softirq processing.

>>> Maybe expand more on what code paths are we trying to improve? Existing
>>> busy polling code is not super readable, so would be nice to simplify
>>> it a bit in the process (if possible) instead of adding one more tunable.
>>
>> There are essentially three possible loops for network processing:
>>
>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>
>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>> with the shortcomings described above
>>
>> 3) epoll -> busy-poll -> napi poll
>>
>> If a system is configured for 1), not much can be done, as it is difficult
>> to interject anything into this loop without adding state and side effects.
>> This is what we tried for the paper, but it ended up being a hack.
>>
>> If however the system is configured for irq deferral, Loops 2) and 3)
>> "wrestle" with each other for control. Injecting the larger
>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>> of Loop 3) and creates the nice pattern describe above.
> 
> And you hit (2) when the epoll goes to sleep and/or when the userspace
> isn't fast enough to keep up with the timer, presumably? I wonder
> if need to use this opportunity and do proper API as Joe hints in the
> cover letter. Something over netlink to say "I'm gonna busy-poll on
> this queue / napi_id and with this timeout". And then we can essentially make
> gro_flush_timeout per queue (and avoid
> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> too hacky already :-(

If someone would implement the necessary changes to make these 
parameters per-napi, this would improve things further, but note that 
the current proposal gives strong performance across a range of 
workloads, which is otherwise difficult to impossible to achieve.

Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake 
of an individual queue or application to make sure that IRQ suspension 
is enabled/disabled right away when the state of the system changes from 
busy to idle and back.

>> [snip]
>>
>>>>     - suspendX:
>>>>       - set defer_hard_irqs to 100
>>>>       - set gro_flush_timeout to X,000
>>>>       - set irq_suspend_timeout to 20,000,000
>>>>       - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>>>         busy_poll_budget = 64, prefer_busy_poll = true)
>>>
>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
>>> to busy_poll sysctl value?
>>
>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
>> busy-polling even when the application does not actually need or want it.
>> Only one iteration through the busy loop is needed to make the new scheme
>> work. Additional napi busy polling over and above is optional.
> 
> Ack, thanks, was trying to understand why not stay with
> busy_poll_usecs=64 for consistency. But I guess you were just
> trying to show that patch 4/5 works.

Right, and we would potentially be wasting CPU cycles by adding more 
busy-looping.

Thanks,
Martin
Stanislav Fomichev Aug. 13, 2024, 1:54 a.m. UTC | #5
On 08/12, Martin Karsten wrote:
> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> > > On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > > On 08/12, Joe Damato wrote:
> > > > > Greetings:
> > > > > 
> > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about
> > > > > ways of reducing tail latency when using epoll-based busy poll and we'd
> > > > > love to get feedback from the list on the code in this series. This is
> > > > > the idea I mentioned at netdev conf, for those who were there. Barring
> > > > > any major issues, we hope to submit this officially shortly after RFC.
> > > > > 
> > > > > The basic idea for suspending IRQs in this manner was described in an
> > > > > earlier paper presented at Sigmetrics 2024 [1].
> > > > 
> > > > Let me explicitly call out the paper. Very nice analysis!
> > > 
> > > Thank you!
> > > 
> > > [snip]
> > > 
> > > > > Here's how it is intended to work:
> > > > >     - An administrator sets the existing sysfs parameters for
> > > > >       defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> > > > > 
> > > > >     - An administrator sets the new sysfs parameter irq_suspend_timeout
> > > > >       to a larger value than gro-timeout to enable IRQ suspension.
> > > > 
> > > > Can you expand more on what's the problem with the existing gro_flush_timeout?
> > > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> > > > perfer_busy_poll case(why?)? Because looking at the first two patches,
> > > > you essentially replace all usages of gro_flush_timeout with a new variable
> > > > and I don't see how it helps.
> > > 
> > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
> > > deferral mechanism and as such, always active when configured. Its static
> > > periodic softirq processing leads to a situation where:
> > > 
> > > - A long gro-flush-timeout causes high latencies when load is sufficiently
> > > below capacity, or
> > > 
> > > - a short gro-flush-timeout causes overhead when softirq execution
> > > asynchronously competes with application processing at high load.
> > > 
> > > The shortcomings of this are documented (to some extent) by our experiments.
> > > See defer20 working well at low load, but having problems at high load,
> > > while defer200 having higher latency at low load.
> > > 
> > > irq-suspend-timeout is only active when an application uses
> > > prefer-busy-polling and in that case, produces a nice alternating pattern of
> > > application processing and networking processing (similar to what we
> > > describe in the paper). This then works well with both low and high load.
> > 
> > So you only want it for the prefer-busy-pollingc case, makes sense. I was
> > a bit confused by the difference between defer200 and suspend200,
> > but now I see that defer200 does not enable busypoll.
> > 
> > I'm assuming that if you enable busypool in defer200 case, the numbers
> > should be similar to suspend200 (ignoring potentially affecting
> > non-busypolling queues due to higher gro_flush_timeout).
> 
> defer200 + napi busy poll is essentially what we labelled "busy" and it does
> not perform as well, since it still suffers interference between application
> and softirq processing.

With all your patches applied? Why? Userspace not keeping up?

> > > > Maybe expand more on what code paths are we trying to improve? Existing
> > > > busy polling code is not super readable, so would be nice to simplify
> > > > it a bit in the process (if possible) instead of adding one more tunable.
> > > 
> > > There are essentially three possible loops for network processing:
> > > 
> > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> > > 
> > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> > > with the shortcomings described above
> > > 
> > > 3) epoll -> busy-poll -> napi poll
> > > 
> > > If a system is configured for 1), not much can be done, as it is difficult
> > > to interject anything into this loop without adding state and side effects.
> > > This is what we tried for the paper, but it ended up being a hack.
> > > 
> > > If however the system is configured for irq deferral, Loops 2) and 3)
> > > "wrestle" with each other for control. Injecting the larger
> > > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> > > of Loop 3) and creates the nice pattern describe above.
> > 
> > And you hit (2) when the epoll goes to sleep and/or when the userspace
> > isn't fast enough to keep up with the timer, presumably? I wonder
> > if need to use this opportunity and do proper API as Joe hints in the
> > cover letter. Something over netlink to say "I'm gonna busy-poll on
> > this queue / napi_id and with this timeout". And then we can essentially make
> > gro_flush_timeout per queue (and avoid
> > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> > too hacky already :-(
> 
> If someone would implement the necessary changes to make these parameters
> per-napi, this would improve things further, but note that the current
> proposal gives strong performance across a range of workloads, which is
> otherwise difficult to impossible to achieve.

Let's see what other people have to say. But we tried to do a similar
setup at Google recently and getting all these parameters right
was not trivial. Joe's recent patch series to push some of these into
epoll context are a step in the right direction. It would be nice to
have more explicit interface to express busy poling preference for
the users vs chasing a bunch of global tunables and fighting against softirq
wakups.

> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> an individual queue or application to make sure that IRQ suspension is
> enabled/disabled right away when the state of the system changes from busy
> to idle and back.

Can we not handle everything in napi_busy_loop? If we can mark some napi
contexts as "explicitly polled by userspace with a larger defer timeout",
we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
which is more like "this particular napi_poll call is user busy polling".

> > > [snip]
> > > 
> > > > >     - suspendX:
> > > > >       - set defer_hard_irqs to 100
> > > > >       - set gro_flush_timeout to X,000
> > > > >       - set irq_suspend_timeout to 20,000,000
> > > > >       - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
> > > > >         busy_poll_budget = 64, prefer_busy_poll = true)
> > > > 
> > > > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
> > > > to busy_poll sysctl value?
> > > 
> > > Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
> > > (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
> > > busy-polling even when the application does not actually need or want it.
> > > Only one iteration through the busy loop is needed to make the new scheme
> > > work. Additional napi busy polling over and above is optional.
> > 
> > Ack, thanks, was trying to understand why not stay with
> > busy_poll_usecs=64 for consistency. But I guess you were just
> > trying to show that patch 4/5 works.
> 
> Right, and we would potentially be wasting CPU cycles by adding more
> busy-looping.

Or potentially improving the latency more if you happen to get more packets
during busy_poll_usecs duration? I'd imagine some applications might
prefer to 100% busy poll without ever going to sleep (that would probably
require getting rid of napi_id tracking in epoll, but that's a different story).
Martin Karsten Aug. 13, 2024, 2:35 a.m. UTC | #6
On 2024-08-12 21:54, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>> On 08/12, Martin Karsten wrote:
>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>> On 08/12, Joe Damato wrote:
>>>>>> Greetings:
>>>>>>
>>>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about
>>>>>> ways of reducing tail latency when using epoll-based busy poll and we'd
>>>>>> love to get feedback from the list on the code in this series. This is
>>>>>> the idea I mentioned at netdev conf, for those who were there. Barring
>>>>>> any major issues, we hope to submit this officially shortly after RFC.
>>>>>>
>>>>>> The basic idea for suspending IRQs in this manner was described in an
>>>>>> earlier paper presented at Sigmetrics 2024 [1].
>>>>>
>>>>> Let me explicitly call out the paper. Very nice analysis!
>>>>
>>>> Thank you!
>>>>
>>>> [snip]
>>>>
>>>>>> Here's how it is intended to work:
>>>>>>      - An administrator sets the existing sysfs parameters for
>>>>>>        defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>>>
>>>>>>      - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>>>        to a larger value than gro-timeout to enable IRQ suspension.
>>>>>
>>>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>>>> and I don't see how it helps.
>>>>
>>>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
>>>> deferral mechanism and as such, always active when configured. Its static
>>>> periodic softirq processing leads to a situation where:
>>>>
>>>> - A long gro-flush-timeout causes high latencies when load is sufficiently
>>>> below capacity, or
>>>>
>>>> - a short gro-flush-timeout causes overhead when softirq execution
>>>> asynchronously competes with application processing at high load.
>>>>
>>>> The shortcomings of this are documented (to some extent) by our experiments.
>>>> See defer20 working well at low load, but having problems at high load,
>>>> while defer200 having higher latency at low load.
>>>>
>>>> irq-suspend-timeout is only active when an application uses
>>>> prefer-busy-polling and in that case, produces a nice alternating pattern of
>>>> application processing and networking processing (similar to what we
>>>> describe in the paper). This then works well with both low and high load.
>>>
>>> So you only want it for the prefer-busy-pollingc case, makes sense. I was
>>> a bit confused by the difference between defer200 and suspend200,
>>> but now I see that defer200 does not enable busypoll.
>>>
>>> I'm assuming that if you enable busypool in defer200 case, the numbers
>>> should be similar to suspend200 (ignoring potentially affecting
>>> non-busypolling queues due to higher gro_flush_timeout).
>>
>> defer200 + napi busy poll is essentially what we labelled "busy" and it does
>> not perform as well, since it still suffers interference between application
>> and softirq processing.
> 
> With all your patches applied? Why? Userspace not keeping up?

Note our "busy" case does not utilize our patches.

As illustrated by our performance numbers, its performance is better 
than the base case, but at the cost of higher cpu utilization and it's 
still not as good as suspend20.

Explanation (conjecture):

It boils down to having to set a particular static value for 
gro-flush-timeout that is then always active.

If busy-poll + application processing takes longer than this timeout, 
the next softirq runs while the application is still active, which 
causes interference.

Once a softirq runs, the irq-loop (Loop 2) takes control. When the 
application thread comes back to epoll_wait, it already finds data, thus 
ep_poll does not run napi_busy_poll at all, thus the irq-loop stays in 
control.

This continues until by chance the application finds no readily 
available data when calling epoll_wait and ep_poll runs another 
napi_busy_poll. Then the system switches back to busy-polling mode.

So essentially the system non-deterministically alternates between 
busy-polling and irq deferral. irq deferral determines the high-order 
tail latencies, but there is still enough interference to make a 
difference. It's not as bad as in the base case, but not as good as 
properly controlled irq suspension.

>>>>> Maybe expand more on what code paths are we trying to improve? Existing
>>>>> busy polling code is not super readable, so would be nice to simplify
>>>>> it a bit in the process (if possible) instead of adding one more tunable.
>>>>
>>>> There are essentially three possible loops for network processing:
>>>>
>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>>>
>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>>>> with the shortcomings described above
>>>>
>>>> 3) epoll -> busy-poll -> napi poll
>>>>
>>>> If a system is configured for 1), not much can be done, as it is difficult
>>>> to interject anything into this loop without adding state and side effects.
>>>> This is what we tried for the paper, but it ended up being a hack.
>>>>
>>>> If however the system is configured for irq deferral, Loops 2) and 3)
>>>> "wrestle" with each other for control. Injecting the larger
>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>>>> of Loop 3) and creates the nice pattern describe above.
>>>
>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
>>> isn't fast enough to keep up with the timer, presumably? I wonder
>>> if need to use this opportunity and do proper API as Joe hints in the
>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
>>> this queue / napi_id and with this timeout". And then we can essentially make
>>> gro_flush_timeout per queue (and avoid
>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
>>> too hacky already :-(
>>
>> If someone would implement the necessary changes to make these parameters
>> per-napi, this would improve things further, but note that the current
>> proposal gives strong performance across a range of workloads, which is
>> otherwise difficult to impossible to achieve.
> 
> Let's see what other people have to say. But we tried to do a similar
> setup at Google recently and getting all these parameters right
> was not trivial. Joe's recent patch series to push some of these into
> epoll context are a step in the right direction. It would be nice to
> have more explicit interface to express busy poling preference for
> the users vs chasing a bunch of global tunables and fighting against softirq
> wakups.

One of the goals of this patch set is to reduce parameter tuning and 
make the parameter setting independent of workload dynamics, so it 
should make things easier. This is of course notwithstanding that 
per-napi settings would be even better.

If you are able to share more details of your previous experiments (here 
or off-list), I would be very interested.

>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>> an individual queue or application to make sure that IRQ suspension is
>> enabled/disabled right away when the state of the system changes from busy
>> to idle and back.
> 
> Can we not handle everything in napi_busy_loop? If we can mark some napi
> contexts as "explicitly polled by userspace with a larger defer timeout",
> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> which is more like "this particular napi_poll call is user busy polling".

Then either the application needs to be polling all the time (wasting 
cpu cycles) or latencies will be determined by the timeout.

Only when switching back and forth between polling and interrupts is it 
possible to get low latencies across a large spectrum of offered loads 
without burning cpu cycles at 100%.

>>>> [snip]
>>>>
>>>>>>      - suspendX:
>>>>>>        - set defer_hard_irqs to 100
>>>>>>        - set gro_flush_timeout to X,000
>>>>>>        - set irq_suspend_timeout to 20,000,000
>>>>>>        - enable busy poll via the existing ioctl (busy_poll_usecs = 0,
>>>>>>          busy_poll_budget = 64, prefer_busy_poll = true)
>>>>>
>>>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback
>>>>> to busy_poll sysctl value?
>>>>
>>>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll
>>>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to
>>>> busy-polling even when the application does not actually need or want it.
>>>> Only one iteration through the busy loop is needed to make the new scheme
>>>> work. Additional napi busy polling over and above is optional.
>>>
>>> Ack, thanks, was trying to understand why not stay with
>>> busy_poll_usecs=64 for consistency. But I guess you were just
>>> trying to show that patch 4/5 works.
>>
>> Right, and we would potentially be wasting CPU cycles by adding more
>> busy-looping.
> 
> Or potentially improving the latency more if you happen to get more packets
> during busy_poll_usecs duration? I'd imagine some applications might
> prefer to 100% busy poll without ever going to sleep (that would probably
> require getting rid of napi_id tracking in epoll, but that's a different story).

Yes, one could do full application-to-napi busy polling. The performance 
would be slightly better than irq suspension, but it would be quite 
wasteful during low load. One premise for our work is that saving cycles 
is a meaningful objective.

Thanks,
Martin
Stanislav Fomichev Aug. 13, 2024, 4:07 a.m. UTC | #7
On 08/12, Martin Karsten wrote:
> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> > > On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > > On 08/12, Martin Karsten wrote:
> > > > > On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > > > > On 08/12, Joe Damato wrote:
> > > > > > > Greetings:
> > > > > > > 
> > > > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about
> > > > > > > ways of reducing tail latency when using epoll-based busy poll and we'd
> > > > > > > love to get feedback from the list on the code in this series. This is
> > > > > > > the idea I mentioned at netdev conf, for those who were there. Barring
> > > > > > > any major issues, we hope to submit this officially shortly after RFC.
> > > > > > > 
> > > > > > > The basic idea for suspending IRQs in this manner was described in an
> > > > > > > earlier paper presented at Sigmetrics 2024 [1].
> > > > > > 
> > > > > > Let me explicitly call out the paper. Very nice analysis!
> > > > > 
> > > > > Thank you!
> > > > > 
> > > > > [snip]
> > > > > 
> > > > > > > Here's how it is intended to work:
> > > > > > >      - An administrator sets the existing sysfs parameters for
> > > > > > >        defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> > > > > > > 
> > > > > > >      - An administrator sets the new sysfs parameter irq_suspend_timeout
> > > > > > >        to a larger value than gro-timeout to enable IRQ suspension.
> > > > > > 
> > > > > > Can you expand more on what's the problem with the existing gro_flush_timeout?
> > > > > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> > > > > > perfer_busy_poll case(why?)? Because looking at the first two patches,
> > > > > > you essentially replace all usages of gro_flush_timeout with a new variable
> > > > > > and I don't see how it helps.
> > > > > 
> > > > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq
> > > > > deferral mechanism and as such, always active when configured. Its static
> > > > > periodic softirq processing leads to a situation where:
> > > > > 
> > > > > - A long gro-flush-timeout causes high latencies when load is sufficiently
> > > > > below capacity, or
> > > > > 
> > > > > - a short gro-flush-timeout causes overhead when softirq execution
> > > > > asynchronously competes with application processing at high load.
> > > > > 
> > > > > The shortcomings of this are documented (to some extent) by our experiments.
> > > > > See defer20 working well at low load, but having problems at high load,
> > > > > while defer200 having higher latency at low load.
> > > > > 
> > > > > irq-suspend-timeout is only active when an application uses
> > > > > prefer-busy-polling and in that case, produces a nice alternating pattern of
> > > > > application processing and networking processing (similar to what we
> > > > > describe in the paper). This then works well with both low and high load.
> > > > 
> > > > So you only want it for the prefer-busy-pollingc case, makes sense. I was
> > > > a bit confused by the difference between defer200 and suspend200,
> > > > but now I see that defer200 does not enable busypoll.
> > > > 
> > > > I'm assuming that if you enable busypool in defer200 case, the numbers
> > > > should be similar to suspend200 (ignoring potentially affecting
> > > > non-busypolling queues due to higher gro_flush_timeout).
> > > 
> > > defer200 + napi busy poll is essentially what we labelled "busy" and it does
> > > not perform as well, since it still suffers interference between application
> > > and softirq processing.
> > 
> > With all your patches applied? Why? Userspace not keeping up?
> 
> Note our "busy" case does not utilize our patches.

Great, thanks for confirming, that makes sense!

> As illustrated by our performance numbers, its performance is better than
> the base case, but at the cost of higher cpu utilization and it's still not
> as good as suspend20.
> 
> Explanation (conjecture):
> 
> It boils down to having to set a particular static value for
> gro-flush-timeout that is then always active.
> 
> If busy-poll + application processing takes longer than this timeout, the
> next softirq runs while the application is still active, which causes
> interference.
> 
> Once a softirq runs, the irq-loop (Loop 2) takes control. When the
> application thread comes back to epoll_wait, it already finds data, thus
> ep_poll does not run napi_busy_poll at all, thus the irq-loop stays in
> control.
> 
> This continues until by chance the application finds no readily available
> data when calling epoll_wait and ep_poll runs another napi_busy_poll. Then
> the system switches back to busy-polling mode.
> 
> So essentially the system non-deterministically alternates between
> busy-polling and irq deferral. irq deferral determines the high-order tail
> latencies, but there is still enough interference to make a difference. It's
> not as bad as in the base case, but not as good as properly controlled irq
> suspension.
> 
> > > > > > Maybe expand more on what code paths are we trying to improve? Existing
> > > > > > busy polling code is not super readable, so would be nice to simplify
> > > > > > it a bit in the process (if possible) instead of adding one more tunable.
> > > > > 
> > > > > There are essentially three possible loops for network processing:
> > > > > 
> > > > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> > > > > 
> > > > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> > > > > with the shortcomings described above
> > > > > 
> > > > > 3) epoll -> busy-poll -> napi poll
> > > > > 
> > > > > If a system is configured for 1), not much can be done, as it is difficult
> > > > > to interject anything into this loop without adding state and side effects.
> > > > > This is what we tried for the paper, but it ended up being a hack.
> > > > > 
> > > > > If however the system is configured for irq deferral, Loops 2) and 3)
> > > > > "wrestle" with each other for control. Injecting the larger
> > > > > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> > > > > of Loop 3) and creates the nice pattern describe above.
> > > > 
> > > > And you hit (2) when the epoll goes to sleep and/or when the userspace
> > > > isn't fast enough to keep up with the timer, presumably? I wonder
> > > > if need to use this opportunity and do proper API as Joe hints in the
> > > > cover letter. Something over netlink to say "I'm gonna busy-poll on
> > > > this queue / napi_id and with this timeout". And then we can essentially make
> > > > gro_flush_timeout per queue (and avoid
> > > > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> > > > too hacky already :-(
> > > 
> > > If someone would implement the necessary changes to make these parameters
> > > per-napi, this would improve things further, but note that the current
> > > proposal gives strong performance across a range of workloads, which is
> > > otherwise difficult to impossible to achieve.
> > 
> > Let's see what other people have to say. But we tried to do a similar
> > setup at Google recently and getting all these parameters right
> > was not trivial. Joe's recent patch series to push some of these into
> > epoll context are a step in the right direction. It would be nice to
> > have more explicit interface to express busy poling preference for
> > the users vs chasing a bunch of global tunables and fighting against softirq
> > wakups.
> 
> One of the goals of this patch set is to reduce parameter tuning and make
> the parameter setting independent of workload dynamics, so it should make
> things easier. This is of course notwithstanding that per-napi settings
> would be even better.
> 
> If you are able to share more details of your previous experiments (here or
> off-list), I would be very interested.

We went through a similar exercise of trying to get the tail latencies down.
Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
we went with a hard-coded napi_id argument instead of tracking) and trying to
get a workable set of budget/timeout/gro_flush. We were fine with burning all
cpu capacity we had and no sleep at all, so we ended up having a bunch
of special cases in epoll loop to avoid the sleep.

But we were trying to make a different model work (the one you mention in the
paper as well) where the userspace busy-pollers are just running napi_poll
on one cpu and the actual work is consumed by the userspace on a different cpu.
(we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
and another epoll fd with actual sockets for signaling).

This mode has a different set of challenges with socket lock, socket rx
queue and the backlog processing :-(

> > > Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> > > an individual queue or application to make sure that IRQ suspension is
> > > enabled/disabled right away when the state of the system changes from busy
> > > to idle and back.
> > 
> > Can we not handle everything in napi_busy_loop? If we can mark some napi
> > contexts as "explicitly polled by userspace with a larger defer timeout",
> > we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> > which is more like "this particular napi_poll call is user busy polling".
> 
> Then either the application needs to be polling all the time (wasting cpu
> cycles) or latencies will be determined by the timeout.
> 
> Only when switching back and forth between polling and interrupts is it
> possible to get low latencies across a large spectrum of offered loads
> without burning cpu cycles at 100%.

Ah, I see what you're saying, yes, you're right. In this case ignore my comment
about ep_suspend_napi_irqs/napi_resume_irqs.

Let's see how other people feel about per-dev irq_suspend_timeout. Properly
disabling napi during busy polling is super useful, but it would still
be nice to plumb irq_suspend_timeout via epoll context or have it set on
a per-napi basis imho.
Martin Karsten Aug. 13, 2024, 1:18 p.m. UTC | #8
On 2024-08-13 00:07, Stanislav Fomichev wrote:
> On 08/12, Martin Karsten wrote:
>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>> On 08/12, Martin Karsten wrote:
>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>> On 08/12, Martin Karsten wrote:
>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>> Greetings:

[snip]

>>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
>>>>>>> busy polling code is not super readable, so would be nice to simplify
>>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
>>>>>>
>>>>>> There are essentially three possible loops for network processing:
>>>>>>
>>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
>>>>>>
>>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
>>>>>> with the shortcomings described above
>>>>>>
>>>>>> 3) epoll -> busy-poll -> napi poll
>>>>>>
>>>>>> If a system is configured for 1), not much can be done, as it is difficult
>>>>>> to interject anything into this loop without adding state and side effects.
>>>>>> This is what we tried for the paper, but it ended up being a hack.
>>>>>>
>>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
>>>>>> "wrestle" with each other for control. Injecting the larger
>>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
>>>>>> of Loop 3) and creates the nice pattern describe above.
>>>>>
>>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
>>>>> isn't fast enough to keep up with the timer, presumably? I wonder
>>>>> if need to use this opportunity and do proper API as Joe hints in the
>>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
>>>>> this queue / napi_id and with this timeout". And then we can essentially make
>>>>> gro_flush_timeout per queue (and avoid
>>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
>>>>> too hacky already :-(
>>>>
>>>> If someone would implement the necessary changes to make these parameters
>>>> per-napi, this would improve things further, but note that the current
>>>> proposal gives strong performance across a range of workloads, which is
>>>> otherwise difficult to impossible to achieve.
>>>
>>> Let's see what other people have to say. But we tried to do a similar
>>> setup at Google recently and getting all these parameters right
>>> was not trivial. Joe's recent patch series to push some of these into
>>> epoll context are a step in the right direction. It would be nice to
>>> have more explicit interface to express busy poling preference for
>>> the users vs chasing a bunch of global tunables and fighting against softirq
>>> wakups.
>>
>> One of the goals of this patch set is to reduce parameter tuning and make
>> the parameter setting independent of workload dynamics, so it should make
>> things easier. This is of course notwithstanding that per-napi settings
>> would be even better.
>>
>> If you are able to share more details of your previous experiments (here or
>> off-list), I would be very interested.
> 
> We went through a similar exercise of trying to get the tail latencies down.
> Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> we went with a hard-coded napi_id argument instead of tracking) and trying to
> get a workable set of budget/timeout/gro_flush. We were fine with burning all
> cpu capacity we had and no sleep at all, so we ended up having a bunch
> of special cases in epoll loop to avoid the sleep.
> 
> But we were trying to make a different model work (the one you mention in the
> paper as well) where the userspace busy-pollers are just running napi_poll
> on one cpu and the actual work is consumed by the userspace on a different cpu.
> (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> and another epoll fd with actual sockets for signaling).
> 
> This mode has a different set of challenges with socket lock, socket rx
> queue and the backlog processing :-(

I agree. That model has challenges and is extremely difficult to tune right.

>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>> an individual queue or application to make sure that IRQ suspension is
>>>> enabled/disabled right away when the state of the system changes from busy
>>>> to idle and back.
>>>
>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>> which is more like "this particular napi_poll call is user busy polling".
>>
>> Then either the application needs to be polling all the time (wasting cpu
>> cycles) or latencies will be determined by the timeout.
>>
>> Only when switching back and forth between polling and interrupts is it
>> possible to get low latencies across a large spectrum of offered loads
>> without burning cpu cycles at 100%.
> 
> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> about ep_suspend_napi_irqs/napi_resume_irqs.

Thanks for probing and double-checking everything! Feedback is important 
for us to properly document our proposal.

> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> disabling napi during busy polling is super useful, but it would still
> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> a per-napi basis imho.

Fingers crossed. I hope this patch will be accepted, because it has 
practical performance and efficiency benefits, and that this will 
further increase the motivation to re-design the entire irq 
defer(/suspend) infrastructure for per-napi settings.

Thanks,
Martin
Jakub Kicinski Aug. 14, 2024, 12:10 a.m. UTC | #9
On Mon, 12 Aug 2024 17:46:42 -0400 Martin Karsten wrote:
> >> Here's how it is intended to work:
> >>    - An administrator sets the existing sysfs parameters for
> >>      defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
> >>
> >>    - An administrator sets the new sysfs parameter irq_suspend_timeout
> >>      to a larger value than gro-timeout to enable IRQ suspension.  
> > 
> > Can you expand more on what's the problem with the existing gro_flush_timeout?
> > Is it defer_hard_irqs_count? Or you want a separate timeout only for the
> > perfer_busy_poll case(why?)? Because looking at the first two patches,
> > you essentially replace all usages of gro_flush_timeout with a new variable
> > and I don't see how it helps.  
> 
> gro-flush-timeout (in combination with defer-hard-irqs) is the default 
> irq deferral mechanism and as such, always active when configured. Its 
> static periodic softirq processing leads to a situation where:
> 
> - A long gro-flush-timeout causes high latencies when load is 
> sufficiently below capacity, or
> 
> - a short gro-flush-timeout causes overhead when softirq execution 
> asynchronously competes with application processing at high load.
> 
> The shortcomings of this are documented (to some extent) by our 
> experiments. See defer20 working well at low load, but having problems 
> at high load, while defer200 having higher latency at low load.
> 
> irq-suspend-timeout is only active when an application uses 
> prefer-busy-polling and in that case, produces a nice alternating 
> pattern of application processing and networking processing (similar to 
> what we describe in the paper). This then works well with both low and 
> high load.

What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
to be used with NICs which either don't have IRQ coalescing or have a
broken implementation. The timeout of 200usec should be perfectly within
range of what NICs can support.

If the NIC IRQ coalescing works, instead of adding a new timeout value
we could add a new deferral control (replacing defer_hard_irqs_count)
which would always kick in after seeing prefer_busy_poll() but also
not kick in if the busy poll harvested 0 packets.

> > Maybe expand more on what code paths are we trying to improve? Existing
> > busy polling code is not super readable, so would be nice to simplify
> > it a bit in the process (if possible) instead of adding one more tunable.  
> 
> There are essentially three possible loops for network processing:
> 
> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> 
> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme 
> with the shortcomings described above
> 
> 3) epoll -> busy-poll -> napi poll
> 
> If a system is configured for 1), not much can be done, as it is 
> difficult to interject anything into this loop without adding state and 
> side effects. This is what we tried for the paper, but it ended up being 
> a hack.
> 
> If however the system is configured for irq deferral, Loops 2) and 3) 
> "wrestle" with each other for control. Injecting the larger 
> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in 
> favour of Loop 3) and creates the nice pattern describe above.
Martin Karsten Aug. 14, 2024, 1:14 a.m. UTC | #10
On 2024-08-13 20:10, Jakub Kicinski wrote:
> On Mon, 12 Aug 2024 17:46:42 -0400 Martin Karsten wrote:
>>>> Here's how it is intended to work:
>>>>     - An administrator sets the existing sysfs parameters for
>>>>       defer_hard_irqs and gro_flush_timeout to enable IRQ deferral.
>>>>
>>>>     - An administrator sets the new sysfs parameter irq_suspend_timeout
>>>>       to a larger value than gro-timeout to enable IRQ suspension.
>>>
>>> Can you expand more on what's the problem with the existing gro_flush_timeout?
>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the
>>> perfer_busy_poll case(why?)? Because looking at the first two patches,
>>> you essentially replace all usages of gro_flush_timeout with a new variable
>>> and I don't see how it helps.
>>
>> gro-flush-timeout (in combination with defer-hard-irqs) is the default
>> irq deferral mechanism and as such, always active when configured. Its
>> static periodic softirq processing leads to a situation where:
>>
>> - A long gro-flush-timeout causes high latencies when load is
>> sufficiently below capacity, or
>>
>> - a short gro-flush-timeout causes overhead when softirq execution
>> asynchronously competes with application processing at high load.
>>
>> The shortcomings of this are documented (to some extent) by our
>> experiments. See defer20 working well at low load, but having problems
>> at high load, while defer200 having higher latency at low load.
>>
>> irq-suspend-timeout is only active when an application uses
>> prefer-busy-polling and in that case, produces a nice alternating
>> pattern of application processing and networking processing (similar to
>> what we describe in the paper). This then works well with both low and
>> high load.
> 
> What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
> to be used with NICs which either don't have IRQ coalescing or have a
> broken implementation. The timeout of 200usec should be perfectly within
> range of what NICs can support.
> 
> If the NIC IRQ coalescing works, instead of adding a new timeout value
> we could add a new deferral control (replacing defer_hard_irqs_count)
> which would always kick in after seeing prefer_busy_poll() but also
> not kick in if the busy poll harvested 0 packets.
Maybe I am missing something, but I believe this would have the same 
problem that we describe for gro-timeout + defer-irq. When busy poll 
does not harvest packets and the application thread is idle and goes to 
sleep, it would then take up to 200 us to get the next interrupt. This 
considerably increases tail latencies under low load.

In order get low latencies under low load, the NIC timeout would have to 
be something like 20 us, but under high load the application thread will 
be busy for longer than 20 us and the interrupt (and softirq) will come 
too early and cause interference.

The fundamental problem is that one fixed timer cannot handle dynamic 
workloads, regardless of whether the timer is implemented in software or 
the NIC. However, the current software implementation of the timer makes 
it easy to add our mechanism that effectively switches between a short 
and a long timeout. I assume it would be more difficult/overhead to 
update the NIC timer all the time.

In other words, the complexity is always the same: A very long timeout 
is needed to suspend irqs during periods of successful busy polling and 
application processing. A short timeout is needed to receive the next 
packet(s) with low latency during idle periods.

It is tempting to think of the second timeout as 0 and in fact re-enable 
interrupts right away. We have tried it, but it leads to a lot of 
interrupts and corresponding inefficiencies, since a system below 
capacity frequently switches between busy and idle. Using a small 
timeout (20 us) for modest deferral and batching when idle is a lot more 
efficient.

Thanks,
Martin
Willem de Bruijn Aug. 14, 2024, 3:16 a.m. UTC | #11
Martin Karsten wrote:
> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>> On 08/12, Martin Karsten wrote:
> >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>> On 08/12, Martin Karsten wrote:
> >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>> Greetings:
> 
> [snip]
> 
> >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
> >>>>>>> busy polling code is not super readable, so would be nice to simplify
> >>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
> >>>>>>
> >>>>>> There are essentially three possible loops for network processing:
> >>>>>>
> >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> >>>>>>
> >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> >>>>>> with the shortcomings described above
> >>>>>>
> >>>>>> 3) epoll -> busy-poll -> napi poll
> >>>>>>
> >>>>>> If a system is configured for 1), not much can be done, as it is difficult
> >>>>>> to interject anything into this loop without adding state and side effects.
> >>>>>> This is what we tried for the paper, but it ended up being a hack.
> >>>>>>
> >>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
> >>>>>> "wrestle" with each other for control. Injecting the larger
> >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> >>>>>> of Loop 3) and creates the nice pattern describe above.
> >>>>>
> >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
> >>>>> isn't fast enough to keep up with the timer, presumably? I wonder
> >>>>> if need to use this opportunity and do proper API as Joe hints in the
> >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
> >>>>> this queue / napi_id and with this timeout". And then we can essentially make
> >>>>> gro_flush_timeout per queue (and avoid
> >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> >>>>> too hacky already :-(
> >>>>
> >>>> If someone would implement the necessary changes to make these parameters
> >>>> per-napi, this would improve things further, but note that the current
> >>>> proposal gives strong performance across a range of workloads, which is
> >>>> otherwise difficult to impossible to achieve.
> >>>
> >>> Let's see what other people have to say. But we tried to do a similar
> >>> setup at Google recently and getting all these parameters right
> >>> was not trivial. Joe's recent patch series to push some of these into
> >>> epoll context are a step in the right direction. It would be nice to
> >>> have more explicit interface to express busy poling preference for
> >>> the users vs chasing a bunch of global tunables and fighting against softirq
> >>> wakups.
> >>
> >> One of the goals of this patch set is to reduce parameter tuning and make
> >> the parameter setting independent of workload dynamics, so it should make
> >> things easier. This is of course notwithstanding that per-napi settings
> >> would be even better.

I don't follow how adding another tunable reduces parameter tuning.

> >>
> >> If you are able to share more details of your previous experiments (here or
> >> off-list), I would be very interested.
> > 
> > We went through a similar exercise of trying to get the tail latencies down.
> > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> > we went with a hard-coded napi_id argument instead of tracking) and trying to
> > get a workable set of budget/timeout/gro_flush. We were fine with burning all
> > cpu capacity we had and no sleep at all, so we ended up having a bunch
> > of special cases in epoll loop to avoid the sleep.
> > 
> > But we were trying to make a different model work (the one you mention in the
> > paper as well) where the userspace busy-pollers are just running napi_poll
> > on one cpu and the actual work is consumed by the userspace on a different cpu.
> > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> > and another epoll fd with actual sockets for signaling).
> > 
> > This mode has a different set of challenges with socket lock, socket rx
> > queue and the backlog processing :-(
> 
> I agree. That model has challenges and is extremely difficult to tune right.
> 
> >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>> an individual queue or application to make sure that IRQ suspension is
> >>>> enabled/disabled right away when the state of the system changes from busy
> >>>> to idle and back.
> >>>
> >>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>> which is more like "this particular napi_poll call is user busy polling".
> >>
> >> Then either the application needs to be polling all the time (wasting cpu
> >> cycles) or latencies will be determined by the timeout.
> >>
> >> Only when switching back and forth between polling and interrupts is it
> >> possible to get low latencies across a large spectrum of offered loads
> >> without burning cpu cycles at 100%.
> > 
> > Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > about ep_suspend_napi_irqs/napi_resume_irqs.
> 
> Thanks for probing and double-checking everything! Feedback is important 
> for us to properly document our proposal.
> 
> > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > disabling napi during busy polling is super useful, but it would still
> > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > a per-napi basis imho.
> 
> Fingers crossed. I hope this patch will be accepted, because it has 
> practical performance and efficiency benefits, and that this will 
> further increase the motivation to re-design the entire irq 
> defer(/suspend) infrastructure for per-napi settings.

Overall, the idea of keeping interrupts disabled during event
processing is very interesting.

Hopefully the interface can be made more intuitive. Or documented more
easily. I had to read the kernel patches to fully (perhaps) grasp it.

Another +1 on the referenced paper. Pointing out a specific difference
in behavior that is unrelated to the protection domain, rather than a
straightforward kernel vs user argument. The paper also had some
explanation that may be clearer for a commit message than the current
cover letter:

"user-level network stacks put the application in charge of the entire
network stack processing (cf. Section 2). Interrupts are disabled and
the application coordinates execution by alternating between
processing existing requests and polling the RX queues for new data"
" [This series extends this behavior to kernel busy polling, while
falling back onto interrupt processing to limit CPU overhead.]

"Instead of re-enabling the respective interrupt(s) as soon as
epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
masked until a subsequent epoll_wait() call comes up empty, i.e., no
events of interest are found and the application thread is about to be
blocked."

"A fallback technical approach would use a kernel timeout set on the
return path from epoll_wait(). If necessary, the timeout re-enables
interrupts regardless of the application’s (mis)behaviour."
[Where misbehavior is not calling epoll_wait again]

"The resulting execution model mimics the execution model of typical
user-level network stacks and does not add any requirements compared
to user-level networking. In fact, it is slightly better, because it
can resort to blocking and interrupt delivery, instead of having to
continuously busyloop during idle times."

This last part shows a preference on your part to a trade-off:
you want low latency, but also low cpu utilization if possible.
This also came up in this thread. Please state that design decision
explicitly.

There are plenty of workloads where burning a core is acceptable
(especially as core count continues increasing), not "slightly worse".
Kernel polling with full busy polling is also already possible, by
choosing a very high napi_defer_hard_irqs and gro_flush_timeout. So
high in fact, that these tunables need not be tuned carefully. So
what this series add is not interrupt suppression during event
processing per se, but doing so in a hybrid mode balancing latency
and cpu load.
Joe Damato Aug. 14, 2024, 2:19 p.m. UTC | #12
On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> Martin Karsten wrote:
> > On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > On 08/12, Martin Karsten wrote:
> > >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > >>> On 08/12, Martin Karsten wrote:
> > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > >>>>> On 08/12, Martin Karsten wrote:
> > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > >>>>>>> On 08/12, Joe Damato wrote:

[...]

> > >>
> > >> One of the goals of this patch set is to reduce parameter tuning and make
> > >> the parameter setting independent of workload dynamics, so it should make
> > >> things easier. This is of course notwithstanding that per-napi settings
> > >> would be even better.
> 
> I don't follow how adding another tunable reduces parameter tuning.

Thanks for taking a look and providing valuable feedback, Willem.

An early draft of the cover letter included some paragraphs which were removed
for the sake of brevity that we can add back in, which addresses your comment
above:

 The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout)
 is useful, but picking the correct values for these settings is difficult and
 the ideal values change as the type of traffic and workload changes.

 For example: picking a large timeout value is good for batching packet
 processing, but induces latency. Picking a small timeout value will interrupt
 user app processing and reduce efficiency. The value chosen would be different
 depending on whether the system is under high load (large timeout) or when less
 busy (small timeout).

As such, adding the new tunable makes it much easier to use the existing ones
and also produces better performance as shown in the results we presented. 

Please let me know if you have any questions; I know that the change we are
introducing is very subtle and I am happy to expand the cover letter if it'd be
helpful for you.

My concern was that the cover letter was too long already, but a big takeaway
for me thus far has been that we should expand the cover letter.

[...]

> > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > disabling napi during busy polling is super useful, but it would still
> > > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > a per-napi basis imho.
> > 
> > Fingers crossed. I hope this patch will be accepted, because it has 
> > practical performance and efficiency benefits, and that this will 
> > further increase the motivation to re-design the entire irq 
> > defer(/suspend) infrastructure for per-napi settings.
> 
> Overall, the idea of keeping interrupts disabled during event
> processing is very interesting.

Thanks; I'm happy to hear we are aligned on this.

> Hopefully the interface can be made more intuitive. Or documented more
> easily. I had to read the kernel patches to fully (perhaps) grasp it.
> 
> Another +1 on the referenced paper. Pointing out a specific difference
> in behavior that is unrelated to the protection domain, rather than a
> straightforward kernel vs user argument. The paper also had some
> explanation that may be clearer for a commit message than the current
> cover letter:
> 
> "user-level network stacks put the application in charge of the entire
> network stack processing (cf. Section 2). Interrupts are disabled and
> the application coordinates execution by alternating between
> processing existing requests and polling the RX queues for new data"
> " [This series extends this behavior to kernel busy polling, while
> falling back onto interrupt processing to limit CPU overhead.]
> 
> "Instead of re-enabling the respective interrupt(s) as soon as
> epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
> masked until a subsequent epoll_wait() call comes up empty, i.e., no
> events of interest are found and the application thread is about to be
> blocked."
> 
> "A fallback technical approach would use a kernel timeout set on the
> return path from epoll_wait(). If necessary, the timeout re-enables
> interrupts regardless of the application´s (mis)behaviour."
> [Where misbehavior is not calling epoll_wait again]
> 
> "The resulting execution model mimics the execution model of typical
> user-level network stacks and does not add any requirements compared
> to user-level networking. In fact, it is slightly better, because it
> can resort to blocking and interrupt delivery, instead of having to
> continuously busyloop during idle times."
>
> This last part shows a preference on your part to a trade-off:
> you want low latency, but also low cpu utilization if possible.
> This also came up in this thread. Please state that design decision
> explicitly.

Sure, we can include that in the list of cover letter updates we
need to make. I could have called it out more clearly, but in the cover
letter [1] I mentioned that latency improved for compared CPU usage (i.e.
CPU efficiency improved):

  The overall takeaway from the results below is that the new mechanism
  (suspend20, see below) results in reduced 99th percentile latency and
  increased QPS in the MAX QPS case (compared to the other cases), and
  reduced latency in the lower QPS cases for comparable CPU usage to the
  base case (and less CPU than the busy case).

> There are plenty of workloads where burning a core is acceptable
> (especially as core count continues increasing), not "slightly worse".

Respectfully, I don't think I'm on board with this argument. "Burning a core"
has side effects even as core counts increase (power, cooling, etc) and it
seems the counter argument is equally valid, as well: there are plenty of
workloads where burning a core is undesirable.

Using less CPU to get comparable performance is strictly better, even if a
system can theoretically support the increased CPU/power/cooling load.

Either way: this is not an either/or. Adding support for the code we've
proposed will be very beneficial for an important set of workloads without
taking anything anyway.

- Joe

[1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
Willem de Bruijn Aug. 14, 2024, 3:08 p.m. UTC | #13
Joe Damato wrote:
> On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> > Martin Karsten wrote:
> > > On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > > On 08/12, Martin Karsten wrote:
> > > >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > > >>> On 08/12, Martin Karsten wrote:
> > > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > >>>>> On 08/12, Martin Karsten wrote:
> > > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > >>>>>>> On 08/12, Joe Damato wrote:
> 
> [...]
> 
> > > >>
> > > >> One of the goals of this patch set is to reduce parameter tuning and make
> > > >> the parameter setting independent of workload dynamics, so it should make
> > > >> things easier. This is of course notwithstanding that per-napi settings
> > > >> would be even better.
> > 
> > I don't follow how adding another tunable reduces parameter tuning.
> 
> Thanks for taking a look and providing valuable feedback, Willem.
> 
> An early draft of the cover letter included some paragraphs which were removed
> for the sake of brevity that we can add back in, which addresses your comment
> above:
> 
>  The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout)
>  is useful, but picking the correct values for these settings is difficult and
>  the ideal values change as the type of traffic and workload changes.
> 
>  For example: picking a large timeout value is good for batching packet
>  processing, but induces latency. Picking a small timeout value will interrupt
>  user app processing and reduce efficiency. The value chosen would be different
>  depending on whether the system is under high load (large timeout) or when less
>  busy (small timeout).
> 
> As such, adding the new tunable makes it much easier to use the existing ones
> and also produces better performance as shown in the results we presented. 
> 
> Please let me know if you have any questions; I know that the change we are
> introducing is very subtle and I am happy to expand the cover letter if it'd be
> helpful for you.
> 
> My concern was that the cover letter was too long already, but a big takeaway
> for me thus far has been that we should expand the cover letter.
> 
> [...]
> 
> > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > > disabling napi during busy polling is super useful, but it would still
> > > > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > > a per-napi basis imho.
> > > 
> > > Fingers crossed. I hope this patch will be accepted, because it has 
> > > practical performance and efficiency benefits, and that this will 
> > > further increase the motivation to re-design the entire irq 
> > > defer(/suspend) infrastructure for per-napi settings.
> > 
> > Overall, the idea of keeping interrupts disabled during event
> > processing is very interesting.
> 
> Thanks; I'm happy to hear we are aligned on this.
> 
> > Hopefully the interface can be made more intuitive. Or documented more
> > easily. I had to read the kernel patches to fully (perhaps) grasp it.
> > 
> > Another +1 on the referenced paper. Pointing out a specific difference
> > in behavior that is unrelated to the protection domain, rather than a
> > straightforward kernel vs user argument. The paper also had some
> > explanation that may be clearer for a commit message than the current
> > cover letter:
> > 
> > "user-level network stacks put the application in charge of the entire
> > network stack processing (cf. Section 2). Interrupts are disabled and
> > the application coordinates execution by alternating between
> > processing existing requests and polling the RX queues for new data"
> > " [This series extends this behavior to kernel busy polling, while
> > falling back onto interrupt processing to limit CPU overhead.]
> > 
> > "Instead of re-enabling the respective interrupt(s) as soon as
> > epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay
> > masked until a subsequent epoll_wait() call comes up empty, i.e., no
> > events of interest are found and the application thread is about to be
> > blocked."
> > 
> > "A fallback technical approach would use a kernel timeout set on the
> > return path from epoll_wait(). If necessary, the timeout re-enables
> > interrupts regardless of the application´s (mis)behaviour."
> > [Where misbehavior is not calling epoll_wait again]
> > 
> > "The resulting execution model mimics the execution model of typical
> > user-level network stacks and does not add any requirements compared
> > to user-level networking. In fact, it is slightly better, because it
> > can resort to blocking and interrupt delivery, instead of having to
> > continuously busyloop during idle times."
> >
> > This last part shows a preference on your part to a trade-off:
> > you want low latency, but also low cpu utilization if possible.
> > This also came up in this thread. Please state that design decision
> > explicitly.
> 
> Sure, we can include that in the list of cover letter updates we
> need to make. I could have called it out more clearly, but in the cover
> letter [1] I mentioned that latency improved for compared CPU usage (i.e.
> CPU efficiency improved):
> 
>   The overall takeaway from the results below is that the new mechanism
>   (suspend20, see below) results in reduced 99th percentile latency and
>   increased QPS in the MAX QPS case (compared to the other cases), and
>   reduced latency in the lower QPS cases for comparable CPU usage to the
>   base case (and less CPU than the busy case).
> 
> > There are plenty of workloads where burning a core is acceptable
> > (especially as core count continues increasing), not "slightly worse".
> 
> Respectfully, I don't think I'm on board with this argument. "Burning a core"
> has side effects even as core counts increase (power, cooling, etc) and it
> seems the counter argument is equally valid, as well: there are plenty of
> workloads where burning a core is undesirable.

Even more, likely. As this might be usable for standard efficiency
focused production services.

> Using less CPU to get comparable performance is strictly better, even if a
> system can theoretically support the increased CPU/power/cooling load.

If it is always a strict win yes. But falling back onto interrupts
with standard moderation will not match busy polling in all cases.

Different solutions for different workloads. No need to stack rank
them. My request is just to be explicit which design point this
chooses, and that the other design point (continuous busy polling) is
already addressed in Linux kernel busypolling.

> Either way: this is not an either/or. Adding support for the code we've
> proposed will be very beneficial for an important set of workloads without
> taking anything anyway.
> 
> - Joe
> 
> [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
Joe Damato Aug. 14, 2024, 3:46 p.m. UTC | #14
On Wed, Aug 14, 2024 at 11:08:45AM -0400, Willem de Bruijn wrote:
> Joe Damato wrote:

[...]

> > On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote:
> > Using less CPU to get comparable performance is strictly better, even if a
> > system can theoretically support the increased CPU/power/cooling load.
> 
> If it is always a strict win yes. But falling back onto interrupts
> with standard moderation will not match busy polling in all cases.
> 
> Different solutions for different workloads. No need to stack rank
> them. My request is just to be explicit which design point this
> chooses, and that the other design point (continuous busy polling) is
> already addressed in Linux kernel busypolling.

Sure, sounds good; we can fix that in the cover letter.

Thanks for taking a look.
Samiullah Khawaja Aug. 14, 2024, 7:53 p.m. UTC | #15
On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>
> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > On 08/12, Martin Karsten wrote:
> >> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>> On 08/12, Martin Karsten wrote:
> >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>> On 08/12, Martin Karsten wrote:
> >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>> Greetings:
>
> [snip]
>
> >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing
> >>>>>>> busy polling code is not super readable, so would be nice to simplify
> >>>>>>> it a bit in the process (if possible) instead of adding one more tunable.
> >>>>>>
> >>>>>> There are essentially three possible loops for network processing:
> >>>>>>
> >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality
> >>>>>>
> >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme
> >>>>>> with the shortcomings described above
> >>>>>>
> >>>>>> 3) epoll -> busy-poll -> napi poll
> >>>>>>
> >>>>>> If a system is configured for 1), not much can be done, as it is difficult
> >>>>>> to interject anything into this loop without adding state and side effects.
> >>>>>> This is what we tried for the paper, but it ended up being a hack.
> >>>>>>
> >>>>>> If however the system is configured for irq deferral, Loops 2) and 3)
> >>>>>> "wrestle" with each other for control. Injecting the larger
> >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour
> >>>>>> of Loop 3) and creates the nice pattern describe above.
> >>>>>
> >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace
> >>>>> isn't fast enough to keep up with the timer, presumably? I wonder
> >>>>> if need to use this opportunity and do proper API as Joe hints in the
> >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on
> >>>>> this queue / napi_id and with this timeout". And then we can essentially make
> >>>>> gro_flush_timeout per queue (and avoid
> >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels
> >>>>> too hacky already :-(
> >>>>
> >>>> If someone would implement the necessary changes to make these parameters
> >>>> per-napi, this would improve things further, but note that the current
> >>>> proposal gives strong performance across a range of workloads, which is
> >>>> otherwise difficult to impossible to achieve.
> >>>
> >>> Let's see what other people have to say. But we tried to do a similar
> >>> setup at Google recently and getting all these parameters right
> >>> was not trivial. Joe's recent patch series to push some of these into
> >>> epoll context are a step in the right direction. It would be nice to
> >>> have more explicit interface to express busy poling preference for
> >>> the users vs chasing a bunch of global tunables and fighting against softirq
> >>> wakups.
> >>
> >> One of the goals of this patch set is to reduce parameter tuning and make
> >> the parameter setting independent of workload dynamics, so it should make
> >> things easier. This is of course notwithstanding that per-napi settings
> >> would be even better.
> >>
> >> If you are able to share more details of your previous experiments (here or
> >> off-list), I would be very interested.
> >
> > We went through a similar exercise of trying to get the tail latencies down.
> > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except
> > we went with a hard-coded napi_id argument instead of tracking) and trying to
> > get a workable set of budget/timeout/gro_flush. We were fine with burning all
> > cpu capacity we had and no sleep at all, so we ended up having a bunch
> > of special cases in epoll loop to avoid the sleep.
> >
> > But we were trying to make a different model work (the one you mention in the
> > paper as well) where the userspace busy-pollers are just running napi_poll
> > on one cpu and the actual work is consumed by the userspace on a different cpu.
> > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll
> > and another epoll fd with actual sockets for signaling).
> >
> > This mode has a different set of challenges with socket lock, socket rx
> > queue and the backlog processing :-(
>
> I agree. That model has challenges and is extremely difficult to tune right.
We noticed a similar issue when we were using the same thread to do
application work
and also napi busy polling on it. Large gro_flush_timeout would
improve throughput under
load but reduce latency in low load and vice versa.

We were actually thinking of having gro_flush_timeout value change
based on whether
napi is in busy_poll state and resetting it to default value for
softirq driven mode when
busy_polling is stopped.

Since we didn't have cpu constraints, we had
dedicated pollers polling napi and separate threads polling for epoll
events and doing
application work.
>
> >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>> an individual queue or application to make sure that IRQ suspension is
> >>>> enabled/disabled right away when the state of the system changes from busy
> >>>> to idle and back.
> >>>
> >>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>> which is more like "this particular napi_poll call is user busy polling".
> >>
> >> Then either the application needs to be polling all the time (wasting cpu
> >> cycles) or latencies will be determined by the timeout.
But if I understand correctly, this means that if the application
thread that is supposed
to do napi busy polling gets busy doing work on the new data/events in
userspace, napi polling
will not be done until the suspend_timeout triggers? Do you dispatch
work to a separate worker
threads, in userspace, from the thread that is doing epoll_wait?
> >>
> >> Only when switching back and forth between polling and interrupts is it
> >> possible to get low latencies across a large spectrum of offered loads
> >> without burning cpu cycles at 100%.
> >
> > Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > about ep_suspend_napi_irqs/napi_resume_irqs.
>
> Thanks for probing and double-checking everything! Feedback is important
> for us to properly document our proposal.
>
> > Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > disabling napi during busy polling is super useful, but it would still
> > be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > a per-napi basis imho.
I agree, this would allow each napi queue to tune itself based on
heuristics. But I think
doing it through epoll independent interface makes more sense as Stan
suggested earlier.
>
> Fingers crossed. I hope this patch will be accepted, because it has
> practical performance and efficiency benefits, and that this will
> further increase the motivation to re-design the entire irq
> defer(/suspend) infrastructure for per-napi settings.
>
> Thanks,
> Martin
>
>
Martin Karsten Aug. 14, 2024, 8:42 p.m. UTC | #16
On 2024-08-14 15:53, Samiullah Khawaja wrote:
> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>
>> On 2024-08-13 00:07, Stanislav Fomichev wrote:
>>> On 08/12, Martin Karsten wrote:
>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>>>> On 08/12, Martin Karsten wrote:
>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>>>> Greetings:

[snip]

>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>>>> an individual queue or application to make sure that IRQ suspension is
>>>>>> enabled/disabled right away when the state of the system changes from busy
>>>>>> to idle and back.
>>>>>
>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>>>> which is more like "this particular napi_poll call is user busy polling".
>>>>
>>>> Then either the application needs to be polling all the time (wasting cpu
>>>> cycles) or latencies will be determined by the timeout.
> But if I understand correctly, this means that if the application
> thread that is supposed
> to do napi busy polling gets busy doing work on the new data/events in
> userspace, napi polling
> will not be done until the suspend_timeout triggers? Do you dispatch
> work to a separate worker
> threads, in userspace, from the thread that is doing epoll_wait?

Yes, napi polling is suspended while the application is busy between 
epoll_wait calls. That's where the benefits are coming from.

The consequences depend on the nature of the application and overall 
preferences for the system. If there's a "dominant" application for a 
number of queues and cores, the resulting latency for other background 
applications using the same queues might not be a problem at all.

One other simple mitigation is limiting the number of events that each 
epoll_wait call accepts. Note that this batch size also determines the 
worst-case latency for the application in question, so there is a 
natural incentive to keep it limited.

A more complex application design, like you suggest, might also be an 
option.

>>>> Only when switching back and forth between polling and interrupts is it
>>>> possible to get low latencies across a large spectrum of offered loads
>>>> without burning cpu cycles at 100%.
>>>
>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
>>> about ep_suspend_napi_irqs/napi_resume_irqs.
>>
>> Thanks for probing and double-checking everything! Feedback is important
>> for us to properly document our proposal.
>>
>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
>>> disabling napi during busy polling is super useful, but it would still
>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
>>> a per-napi basis imho.
> I agree, this would allow each napi queue to tune itself based on
> heuristics. But I think
> doing it through epoll independent interface makes more sense as Stan
> suggested earlier.

The question is whether to add a useful mechanism (one sysfs parameter 
and a few lines of code) that is optional, but with demonstrable and 
significant performance/efficiency improvements for an important class 
of applications - or wait for an uncertain future?

Note that adding our mechanism in no way precludes switching the control 
parameters from per-device to per-napi as Joe alluded to earlier. In 
fact, it increases the incentive for doing so.

After working on this for quite a while, I am skeptical that anything 
fundamentally different could be done without re-architecting the entire 
napi control flow.

Thanks,
Martin
Willem de Bruijn Aug. 16, 2024, 2:27 p.m. UTC | #17
Martin Karsten wrote:
> On 2024-08-14 15:53, Samiullah Khawaja wrote:
> > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>
> >> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> >>> On 08/12, Martin Karsten wrote:
> >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>>>> On 08/12, Martin Karsten wrote:
> >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>>>> On 08/12, Martin Karsten wrote:
> >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>>>> Greetings:
> 
> [snip]
> 
> >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>>>> an individual queue or application to make sure that IRQ suspension is
> >>>>>> enabled/disabled right away when the state of the system changes from busy
> >>>>>> to idle and back.
> >>>>>
> >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>>>> which is more like "this particular napi_poll call is user busy polling".
> >>>>
> >>>> Then either the application needs to be polling all the time (wasting cpu
> >>>> cycles) or latencies will be determined by the timeout.
> > But if I understand correctly, this means that if the application
> > thread that is supposed
> > to do napi busy polling gets busy doing work on the new data/events in
> > userspace, napi polling
> > will not be done until the suspend_timeout triggers? Do you dispatch
> > work to a separate worker
> > threads, in userspace, from the thread that is doing epoll_wait?
> 
> Yes, napi polling is suspended while the application is busy between 
> epoll_wait calls. That's where the benefits are coming from.
> 
> The consequences depend on the nature of the application and overall 
> preferences for the system. If there's a "dominant" application for a 
> number of queues and cores, the resulting latency for other background 
> applications using the same queues might not be a problem at all.
> 
> One other simple mitigation is limiting the number of events that each 
> epoll_wait call accepts. Note that this batch size also determines the 
> worst-case latency for the application in question, so there is a 
> natural incentive to keep it limited.
> 
> A more complex application design, like you suggest, might also be an 
> option.
> 
> >>>> Only when switching back and forth between polling and interrupts is it
> >>>> possible to get low latencies across a large spectrum of offered loads
> >>>> without burning cpu cycles at 100%.
> >>>
> >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> >>> about ep_suspend_napi_irqs/napi_resume_irqs.
> >>
> >> Thanks for probing and double-checking everything! Feedback is important
> >> for us to properly document our proposal.
> >>
> >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> >>> disabling napi during busy polling is super useful, but it would still
> >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> >>> a per-napi basis imho.
> > I agree, this would allow each napi queue to tune itself based on
> > heuristics. But I think
> > doing it through epoll independent interface makes more sense as Stan
> > suggested earlier.
> 
> The question is whether to add a useful mechanism (one sysfs parameter 
> and a few lines of code) that is optional, but with demonstrable and 
> significant performance/efficiency improvements for an important class 
> of applications - or wait for an uncertain future?

The issue is that this one little change can never be removed, as it
becomes ABI.

Let's get the right API from the start.

Not sure that a global variable, or sysfs as API, is the right one.
 
> Note that adding our mechanism in no way precludes switching the control 
> parameters from per-device to per-napi as Joe alluded to earlier. In 
> fact, it increases the incentive for doing so.
> 
> After working on this for quite a while, I am skeptical that anything 
> fundamentally different could be done without re-architecting the entire 
> napi control flow.
> 
> Thanks,
> Martin
>
Willem de Bruijn Aug. 16, 2024, 2:59 p.m. UTC | #18
Willem de Bruijn wrote:
> Martin Karsten wrote:
> > On 2024-08-14 15:53, Samiullah Khawaja wrote:
> > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > >>
> > >> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > >>> On 08/12, Martin Karsten wrote:
> > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > >>>>> On 08/12, Martin Karsten wrote:
> > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > >>>>>>> On 08/12, Martin Karsten wrote:
> > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > >>>>>>>>> On 08/12, Joe Damato wrote:
> > >>>>>>>>>> Greetings:
> > 
> > [snip]
> > 
> > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> > >>>>>> an individual queue or application to make sure that IRQ suspension is
> > >>>>>> enabled/disabled right away when the state of the system changes from busy
> > >>>>>> to idle and back.
> > >>>>>
> > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
> > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> > >>>>> which is more like "this particular napi_poll call is user busy polling".
> > >>>>
> > >>>> Then either the application needs to be polling all the time (wasting cpu
> > >>>> cycles) or latencies will be determined by the timeout.
> > > But if I understand correctly, this means that if the application
> > > thread that is supposed
> > > to do napi busy polling gets busy doing work on the new data/events in
> > > userspace, napi polling
> > > will not be done until the suspend_timeout triggers? Do you dispatch
> > > work to a separate worker
> > > threads, in userspace, from the thread that is doing epoll_wait?
> > 
> > Yes, napi polling is suspended while the application is busy between 
> > epoll_wait calls. That's where the benefits are coming from.
> > 
> > The consequences depend on the nature of the application and overall 
> > preferences for the system. If there's a "dominant" application for a 
> > number of queues and cores, the resulting latency for other background 
> > applications using the same queues might not be a problem at all.
> > 
> > One other simple mitigation is limiting the number of events that each 
> > epoll_wait call accepts. Note that this batch size also determines the 
> > worst-case latency for the application in question, so there is a 
> > natural incentive to keep it limited.
> > 
> > A more complex application design, like you suggest, might also be an 
> > option.
> > 
> > >>>> Only when switching back and forth between polling and interrupts is it
> > >>>> possible to get low latencies across a large spectrum of offered loads
> > >>>> without burning cpu cycles at 100%.
> > >>>
> > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > >>> about ep_suspend_napi_irqs/napi_resume_irqs.
> > >>
> > >> Thanks for probing and double-checking everything! Feedback is important
> > >> for us to properly document our proposal.
> > >>
> > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > >>> disabling napi during busy polling is super useful, but it would still
> > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > >>> a per-napi basis imho.
> > > I agree, this would allow each napi queue to tune itself based on
> > > heuristics. But I think
> > > doing it through epoll independent interface makes more sense as Stan
> > > suggested earlier.
> > 
> > The question is whether to add a useful mechanism (one sysfs parameter 
> > and a few lines of code) that is optional, but with demonstrable and 
> > significant performance/efficiency improvements for an important class 
> > of applications - or wait for an uncertain future?
> 
> The issue is that this one little change can never be removed, as it
> becomes ABI.
> 
> Let's get the right API from the start.
> 
> Not sure that a global variable, or sysfs as API, is the right one.

Sorry per-device, not global.

My main concern is that it adds yet another user tunable integer, for
which the right value is not obvious.

If the only goal is to safely reenable interrupts when the application
stops calling epoll_wait, does this have to be user tunable?

Can it be either a single good enough constant, or derived from
another tunable, like busypoll_read.
Joe Damato Aug. 16, 2024, 3:25 p.m. UTC | #19
On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
> Willem de Bruijn wrote:
> > Martin Karsten wrote:
> > > On 2024-08-14 15:53, Samiullah Khawaja wrote:
> > > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > >>
> > > >> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > >>> On 08/12, Martin Karsten wrote:
> > > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > > >>>>> On 08/12, Martin Karsten wrote:
> > > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > >>>>>>> On 08/12, Martin Karsten wrote:
> > > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > >>>>>>>>> On 08/12, Joe Damato wrote:
> > > >>>>>>>>>> Greetings:
> > > 
> > > [snip]
> > > 
> > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> > > >>>>>> an individual queue or application to make sure that IRQ suspension is
> > > >>>>>> enabled/disabled right away when the state of the system changes from busy
> > > >>>>>> to idle and back.
> > > >>>>>
> > > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> > > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
> > > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> > > >>>>> which is more like "this particular napi_poll call is user busy polling".
> > > >>>>
> > > >>>> Then either the application needs to be polling all the time (wasting cpu
> > > >>>> cycles) or latencies will be determined by the timeout.
> > > > But if I understand correctly, this means that if the application
> > > > thread that is supposed
> > > > to do napi busy polling gets busy doing work on the new data/events in
> > > > userspace, napi polling
> > > > will not be done until the suspend_timeout triggers? Do you dispatch
> > > > work to a separate worker
> > > > threads, in userspace, from the thread that is doing epoll_wait?
> > > 
> > > Yes, napi polling is suspended while the application is busy between 
> > > epoll_wait calls. That's where the benefits are coming from.
> > > 
> > > The consequences depend on the nature of the application and overall 
> > > preferences for the system. If there's a "dominant" application for a 
> > > number of queues and cores, the resulting latency for other background 
> > > applications using the same queues might not be a problem at all.
> > > 
> > > One other simple mitigation is limiting the number of events that each 
> > > epoll_wait call accepts. Note that this batch size also determines the 
> > > worst-case latency for the application in question, so there is a 
> > > natural incentive to keep it limited.
> > > 
> > > A more complex application design, like you suggest, might also be an 
> > > option.
> > > 
> > > >>>> Only when switching back and forth between polling and interrupts is it
> > > >>>> possible to get low latencies across a large spectrum of offered loads
> > > >>>> without burning cpu cycles at 100%.
> > > >>>
> > > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > > >>> about ep_suspend_napi_irqs/napi_resume_irqs.
> > > >>
> > > >> Thanks for probing and double-checking everything! Feedback is important
> > > >> for us to properly document our proposal.
> > > >>
> > > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > >>> disabling napi during busy polling is super useful, but it would still
> > > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > >>> a per-napi basis imho.
> > > > I agree, this would allow each napi queue to tune itself based on
> > > > heuristics. But I think
> > > > doing it through epoll independent interface makes more sense as Stan
> > > > suggested earlier.
> > > 
> > > The question is whether to add a useful mechanism (one sysfs parameter 
> > > and a few lines of code) that is optional, but with demonstrable and 
> > > significant performance/efficiency improvements for an important class 
> > > of applications - or wait for an uncertain future?
> > 
> > The issue is that this one little change can never be removed, as it
> > becomes ABI.
> > 
> > Let's get the right API from the start.
> > 
> > Not sure that a global variable, or sysfs as API, is the right one.
> 
> Sorry per-device, not global.
> 
> My main concern is that it adds yet another user tunable integer, for
> which the right value is not obvious.

This is a feature for advanced users just like SO_INCOMING_NAPI_ID
and countless other features.

The value may not be obvious, but guidance (in the form of
documentation) can be provided.

> If the only goal is to safely reenable interrupts when the application
> stops calling epoll_wait, does this have to be user tunable?
> 
> Can it be either a single good enough constant, or derived from
> another tunable, like busypoll_read.

I believe you meant busy_read here, is that right?

At any rate:

  - I don't think a single constant is appropriate, just as it
    wasn't appropriate for the existing mechanism
    (napi_defer_hard_irqs/gro_flush_timeout), and

  - Deriving the value from a pre-existing parameter to preserve the
    ABI, like busy_read, makes using this more confusing for users
    and complicates the API significantly.

I agree we should get the API right from the start; that's why we've
submit this as an RFC ;)

We are happy to take suggestions from the community, but, IMHO,
re-using an existing parameter for a different purpose only in
certain circumstances (if I understand your suggestions) is a much
worse choice than adding a new tunable that clearly states its
intended singular purpose.

- Joe
Willem de Bruijn Aug. 16, 2024, 5:01 p.m. UTC | #20
Joe Damato wrote:
> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
> > Willem de Bruijn wrote:
> > > Martin Karsten wrote:
> > > > On 2024-08-14 15:53, Samiullah Khawaja wrote:
> > > > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> > > > >>
> > > > >> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> > > > >>> On 08/12, Martin Karsten wrote:
> > > > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> > > > >>>>> On 08/12, Martin Karsten wrote:
> > > > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> > > > >>>>>>> On 08/12, Martin Karsten wrote:
> > > > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> > > > >>>>>>>>> On 08/12, Joe Damato wrote:
> > > > >>>>>>>>>> Greetings:
> > > > 
> > > > [snip]
> > > > 
> > > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> > > > >>>>>> an individual queue or application to make sure that IRQ suspension is
> > > > >>>>>> enabled/disabled right away when the state of the system changes from busy
> > > > >>>>>> to idle and back.
> > > > >>>>>
> > > > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> > > > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
> > > > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> > > > >>>>> which is more like "this particular napi_poll call is user busy polling".
> > > > >>>>
> > > > >>>> Then either the application needs to be polling all the time (wasting cpu
> > > > >>>> cycles) or latencies will be determined by the timeout.
> > > > > But if I understand correctly, this means that if the application
> > > > > thread that is supposed
> > > > > to do napi busy polling gets busy doing work on the new data/events in
> > > > > userspace, napi polling
> > > > > will not be done until the suspend_timeout triggers? Do you dispatch
> > > > > work to a separate worker
> > > > > threads, in userspace, from the thread that is doing epoll_wait?
> > > > 
> > > > Yes, napi polling is suspended while the application is busy between 
> > > > epoll_wait calls. That's where the benefits are coming from.
> > > > 
> > > > The consequences depend on the nature of the application and overall 
> > > > preferences for the system. If there's a "dominant" application for a 
> > > > number of queues and cores, the resulting latency for other background 
> > > > applications using the same queues might not be a problem at all.
> > > > 
> > > > One other simple mitigation is limiting the number of events that each 
> > > > epoll_wait call accepts. Note that this batch size also determines the 
> > > > worst-case latency for the application in question, so there is a 
> > > > natural incentive to keep it limited.
> > > > 
> > > > A more complex application design, like you suggest, might also be an 
> > > > option.
> > > > 
> > > > >>>> Only when switching back and forth between polling and interrupts is it
> > > > >>>> possible to get low latencies across a large spectrum of offered loads
> > > > >>>> without burning cpu cycles at 100%.
> > > > >>>
> > > > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> > > > >>> about ep_suspend_napi_irqs/napi_resume_irqs.
> > > > >>
> > > > >> Thanks for probing and double-checking everything! Feedback is important
> > > > >> for us to properly document our proposal.
> > > > >>
> > > > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> > > > >>> disabling napi during busy polling is super useful, but it would still
> > > > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> > > > >>> a per-napi basis imho.
> > > > > I agree, this would allow each napi queue to tune itself based on
> > > > > heuristics. But I think
> > > > > doing it through epoll independent interface makes more sense as Stan
> > > > > suggested earlier.
> > > > 
> > > > The question is whether to add a useful mechanism (one sysfs parameter 
> > > > and a few lines of code) that is optional, but with demonstrable and 
> > > > significant performance/efficiency improvements for an important class 
> > > > of applications - or wait for an uncertain future?
> > > 
> > > The issue is that this one little change can never be removed, as it
> > > becomes ABI.
> > > 
> > > Let's get the right API from the start.
> > > 
> > > Not sure that a global variable, or sysfs as API, is the right one.
> > 
> > Sorry per-device, not global.
> > 
> > My main concern is that it adds yet another user tunable integer, for
> > which the right value is not obvious.
> 
> This is a feature for advanced users just like SO_INCOMING_NAPI_ID
> and countless other features.
> 
> The value may not be obvious, but guidance (in the form of
> documentation) can be provided.

Okay. Could you share a stab at what that would look like?

> > If the only goal is to safely reenable interrupts when the application
> > stops calling epoll_wait, does this have to be user tunable?
> > 
> > Can it be either a single good enough constant, or derived from
> > another tunable, like busypoll_read.
> 
> I believe you meant busy_read here, is that right?
> 
> At any rate:
> 
>   - I don't think a single constant is appropriate, just as it
>     wasn't appropriate for the existing mechanism
>     (napi_defer_hard_irqs/gro_flush_timeout), and
> 
>   - Deriving the value from a pre-existing parameter to preserve the
>     ABI, like busy_read, makes using this more confusing for users
>     and complicates the API significantly.
> 
> I agree we should get the API right from the start; that's why we've
> submit this as an RFC ;)
> 
> We are happy to take suggestions from the community, but, IMHO,
> re-using an existing parameter for a different purpose only in
> certain circumstances (if I understand your suggestions) is a much
> worse choice than adding a new tunable that clearly states its
> intended singular purpose.

Ack. I was thinking whether an epoll flag through your new epoll
ioctl interface to toggle the IRQ suspension (and timer start)
would be preferable. Because more fine grained.

Also, the value is likely dependent more on the expected duration
of userspace processing? If so, it would be the same for all
devices, so does a per-netdev value make sense?
Martin Karsten Aug. 16, 2024, 8:03 p.m. UTC | #21
On 2024-08-16 13:01, Willem de Bruijn wrote:
> Joe Damato wrote:
>> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
>>> Willem de Bruijn wrote:
>>>> Martin Karsten wrote:
>>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote:
>>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>>>>
>>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote:
>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>>>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>>>>>>>>> Greetings:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is
>>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy
>>>>>>>>>>> to idle and back.
>>>>>>>>>>
>>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>>>>>>>>> which is more like "this particular napi_poll call is user busy polling".
>>>>>>>>>
>>>>>>>>> Then either the application needs to be polling all the time (wasting cpu
>>>>>>>>> cycles) or latencies will be determined by the timeout.
>>>>>> But if I understand correctly, this means that if the application
>>>>>> thread that is supposed
>>>>>> to do napi busy polling gets busy doing work on the new data/events in
>>>>>> userspace, napi polling
>>>>>> will not be done until the suspend_timeout triggers? Do you dispatch
>>>>>> work to a separate worker
>>>>>> threads, in userspace, from the thread that is doing epoll_wait?
>>>>>
>>>>> Yes, napi polling is suspended while the application is busy between
>>>>> epoll_wait calls. That's where the benefits are coming from.
>>>>>
>>>>> The consequences depend on the nature of the application and overall
>>>>> preferences for the system. If there's a "dominant" application for a
>>>>> number of queues and cores, the resulting latency for other background
>>>>> applications using the same queues might not be a problem at all.
>>>>>
>>>>> One other simple mitigation is limiting the number of events that each
>>>>> epoll_wait call accepts. Note that this batch size also determines the
>>>>> worst-case latency for the application in question, so there is a
>>>>> natural incentive to keep it limited.
>>>>>
>>>>> A more complex application design, like you suggest, might also be an
>>>>> option.
>>>>>
>>>>>>>>> Only when switching back and forth between polling and interrupts is it
>>>>>>>>> possible to get low latencies across a large spectrum of offered loads
>>>>>>>>> without burning cpu cycles at 100%.
>>>>>>>>
>>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
>>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs.
>>>>>>>
>>>>>>> Thanks for probing and double-checking everything! Feedback is important
>>>>>>> for us to properly document our proposal.
>>>>>>>
>>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
>>>>>>>> disabling napi during busy polling is super useful, but it would still
>>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
>>>>>>>> a per-napi basis imho.
>>>>>> I agree, this would allow each napi queue to tune itself based on
>>>>>> heuristics. But I think
>>>>>> doing it through epoll independent interface makes more sense as Stan
>>>>>> suggested earlier.
>>>>>
>>>>> The question is whether to add a useful mechanism (one sysfs parameter
>>>>> and a few lines of code) that is optional, but with demonstrable and
>>>>> significant performance/efficiency improvements for an important class
>>>>> of applications - or wait for an uncertain future?
>>>>
>>>> The issue is that this one little change can never be removed, as it
>>>> becomes ABI.
>>>>
>>>> Let's get the right API from the start.
>>>>
>>>> Not sure that a global variable, or sysfs as API, is the right one.
>>>
>>> Sorry per-device, not global.
>>>
>>> My main concern is that it adds yet another user tunable integer, for
>>> which the right value is not obvious.
>>
>> This is a feature for advanced users just like SO_INCOMING_NAPI_ID
>> and countless other features.
>>
>> The value may not be obvious, but guidance (in the form of
>> documentation) can be provided.
> 
> Okay. Could you share a stab at what that would look like?

The timeout needs to be large enough that an application can get a 
meaningful number of incoming requests processed without softirq 
interference. At the same time, the timeout value determines the 
worst-case delivery delay that a concurrent application using the same 
queue(s) might experience. Please also see my response to Samiullah 
quoted above. The specific circumstances and trade-offs might vary, 
that's why a simple constant likely won't do.

>>> If the only goal is to safely reenable interrupts when the application
>>> stops calling epoll_wait, does this have to be user tunable?
>>>
>>> Can it be either a single good enough constant, or derived from
>>> another tunable, like busypoll_read.
>>
>> I believe you meant busy_read here, is that right?
>>
>> At any rate:
>>
>>    - I don't think a single constant is appropriate, just as it
>>      wasn't appropriate for the existing mechanism
>>      (napi_defer_hard_irqs/gro_flush_timeout), and
>>
>>    - Deriving the value from a pre-existing parameter to preserve the
>>      ABI, like busy_read, makes using this more confusing for users
>>      and complicates the API significantly.
>>
>> I agree we should get the API right from the start; that's why we've
>> submit this as an RFC ;)
>>
>> We are happy to take suggestions from the community, but, IMHO,
>> re-using an existing parameter for a different purpose only in
>> certain circumstances (if I understand your suggestions) is a much
>> worse choice than adding a new tunable that clearly states its
>> intended singular purpose.
> 
> Ack. I was thinking whether an epoll flag through your new epoll
> ioctl interface to toggle the IRQ suspension (and timer start)
> would be preferable. Because more fine grained.

A value provided by an application through the epoll ioctl would not be 
subject to admin oversight, so a misbehaving application could set an 
arbitrary timeout value. A sysfs value needs to be set by an admin. The 
ideal timeout value depends both on the particular target application as 
well as concurrent applications using the same queue(s) - as sketched above.

> Also, the value is likely dependent more on the expected duration
> of userspace processing? If so, it would be the same for all
> devices, so does a per-netdev value make sense?

It is per-netdev in the current proposal to be at the same granularity 
as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension 
operates at the same level/granularity. This allows for more control 
than a global setting and it can be migrated to per-napi settings along 
with gro_flush_timeout and napi_defer_hard_irqs when the time comes.

Thanks,
Martin
Willem de Bruijn Aug. 16, 2024, 8:58 p.m. UTC | #22
Martin Karsten wrote:
> On 2024-08-16 13:01, Willem de Bruijn wrote:
> > Joe Damato wrote:
> >> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
> >>> Willem de Bruijn wrote:
> >>>> Martin Karsten wrote:
> >>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote:
> >>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
> >>>>>>>
> >>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote:
> >>>>>>>> On 08/12, Martin Karsten wrote:
> >>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
> >>>>>>>>>> On 08/12, Martin Karsten wrote:
> >>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
> >>>>>>>>>>>> On 08/12, Martin Karsten wrote:
> >>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
> >>>>>>>>>>>>>> On 08/12, Joe Damato wrote:
> >>>>>>>>>>>>>>> Greetings:
> >>>>>
> >>>>> [snip]
> >>>>>
> >>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
> >>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is
> >>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy
> >>>>>>>>>>> to idle and back.
> >>>>>>>>>>
> >>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
> >>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
> >>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
> >>>>>>>>>> which is more like "this particular napi_poll call is user busy polling".
> >>>>>>>>>
> >>>>>>>>> Then either the application needs to be polling all the time (wasting cpu
> >>>>>>>>> cycles) or latencies will be determined by the timeout.
> >>>>>> But if I understand correctly, this means that if the application
> >>>>>> thread that is supposed
> >>>>>> to do napi busy polling gets busy doing work on the new data/events in
> >>>>>> userspace, napi polling
> >>>>>> will not be done until the suspend_timeout triggers? Do you dispatch
> >>>>>> work to a separate worker
> >>>>>> threads, in userspace, from the thread that is doing epoll_wait?
> >>>>>
> >>>>> Yes, napi polling is suspended while the application is busy between
> >>>>> epoll_wait calls. That's where the benefits are coming from.
> >>>>>
> >>>>> The consequences depend on the nature of the application and overall
> >>>>> preferences for the system. If there's a "dominant" application for a
> >>>>> number of queues and cores, the resulting latency for other background
> >>>>> applications using the same queues might not be a problem at all.
> >>>>>
> >>>>> One other simple mitigation is limiting the number of events that each
> >>>>> epoll_wait call accepts. Note that this batch size also determines the
> >>>>> worst-case latency for the application in question, so there is a
> >>>>> natural incentive to keep it limited.
> >>>>>
> >>>>> A more complex application design, like you suggest, might also be an
> >>>>> option.
> >>>>>
> >>>>>>>>> Only when switching back and forth between polling and interrupts is it
> >>>>>>>>> possible to get low latencies across a large spectrum of offered loads
> >>>>>>>>> without burning cpu cycles at 100%.
> >>>>>>>>
> >>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
> >>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs.
> >>>>>>>
> >>>>>>> Thanks for probing and double-checking everything! Feedback is important
> >>>>>>> for us to properly document our proposal.
> >>>>>>>
> >>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
> >>>>>>>> disabling napi during busy polling is super useful, but it would still
> >>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
> >>>>>>>> a per-napi basis imho.
> >>>>>> I agree, this would allow each napi queue to tune itself based on
> >>>>>> heuristics. But I think
> >>>>>> doing it through epoll independent interface makes more sense as Stan
> >>>>>> suggested earlier.
> >>>>>
> >>>>> The question is whether to add a useful mechanism (one sysfs parameter
> >>>>> and a few lines of code) that is optional, but with demonstrable and
> >>>>> significant performance/efficiency improvements for an important class
> >>>>> of applications - or wait for an uncertain future?
> >>>>
> >>>> The issue is that this one little change can never be removed, as it
> >>>> becomes ABI.
> >>>>
> >>>> Let's get the right API from the start.
> >>>>
> >>>> Not sure that a global variable, or sysfs as API, is the right one.
> >>>
> >>> Sorry per-device, not global.
> >>>
> >>> My main concern is that it adds yet another user tunable integer, for
> >>> which the right value is not obvious.
> >>
> >> This is a feature for advanced users just like SO_INCOMING_NAPI_ID
> >> and countless other features.
> >>
> >> The value may not be obvious, but guidance (in the form of
> >> documentation) can be provided.
> > 
> > Okay. Could you share a stab at what that would look like?
> 
> The timeout needs to be large enough that an application can get a 
> meaningful number of incoming requests processed without softirq 
> interference. At the same time, the timeout value determines the 
> worst-case delivery delay that a concurrent application using the same 
> queue(s) might experience. Please also see my response to Samiullah 
> quoted above. The specific circumstances and trade-offs might vary, 
> that's why a simple constant likely won't do.

Thanks. I really do mean this as an exercise of what documentation in
Documentation/networking/napi.rst will look like. That helps makes the
case that the interface is reasonably ease to use (even if only
targeting advanced users).

How does a user measure how much time a process will spend on
processing a meaningful number of incoming requests, for instance.
In practice, probably just a hunch?

Playing devil's advocate some more: given that ethtool usecs have to
be chosen with a similar trade-off between latency and efficiency,
could a multiplicative factor of this (or gro_flush_timeout, same
thing) be sufficient and easier to choose? The documentation does
state that the value chosen must be >= gro_flush_timeout.
 
> >>> If the only goal is to safely reenable interrupts when the application
> >>> stops calling epoll_wait, does this have to be user tunable?
> >>>
> >>> Can it be either a single good enough constant, or derived from
> >>> another tunable, like busypoll_read.
> >>
> >> I believe you meant busy_read here, is that right?
> >>
> >> At any rate:
> >>
> >>    - I don't think a single constant is appropriate, just as it
> >>      wasn't appropriate for the existing mechanism
> >>      (napi_defer_hard_irqs/gro_flush_timeout), and
> >>
> >>    - Deriving the value from a pre-existing parameter to preserve the
> >>      ABI, like busy_read, makes using this more confusing for users
> >>      and complicates the API significantly.
> >>
> >> I agree we should get the API right from the start; that's why we've
> >> submit this as an RFC ;)
> >>
> >> We are happy to take suggestions from the community, but, IMHO,
> >> re-using an existing parameter for a different purpose only in
> >> certain circumstances (if I understand your suggestions) is a much
> >> worse choice than adding a new tunable that clearly states its
> >> intended singular purpose.
> > 
> > Ack. I was thinking whether an epoll flag through your new epoll
> > ioctl interface to toggle the IRQ suspension (and timer start)
> > would be preferable. Because more fine grained.
> 
> A value provided by an application through the epoll ioctl would not be 
> subject to admin oversight, so a misbehaving application could set an 
> arbitrary timeout value. A sysfs value needs to be set by an admin. The 
> ideal timeout value depends both on the particular target application as 
> well as concurrent applications using the same queue(s) - as sketched above.

I meant setting the value systemwide (or per-device), but opting in to
the feature a binary epoll options. Really an epoll_wait flag, if we
had flags.

Any admin privileged operations can also be protected at the epoll
level by requiring CAP_NET_ADMIN too, of course. But fair point that
this might operate in a multi-process environment, so values should
not be hardcoded into the binaries.

Just asking questions to explore the option space so as not to settle
on an API too soon. Given that, as said, we cannot remove it later.

> > Also, the value is likely dependent more on the expected duration
> > of userspace processing? If so, it would be the same for all
> > devices, so does a per-netdev value make sense?
> 
> It is per-netdev in the current proposal to be at the same granularity 
> as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension 
> operates at the same level/granularity. This allows for more control 
> than a global setting and it can be migrated to per-napi settings along 
> with gro_flush_timeout and napi_defer_hard_irqs when the time comes.

Ack, makes sense. Many of these design choices and their rationale are
good to explicitly capture in the commit message.
Joe Damato Aug. 17, 2024, 10 a.m. UTC | #23
On Fri, Aug 16, 2024 at 01:01:42PM -0400, Willem de Bruijn wrote:
> Joe Damato wrote:
> > On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
[...]
> > > Willem de Bruijn wrote:
> > > If the only goal is to safely reenable interrupts when the application
> > > stops calling epoll_wait, does this have to be user tunable?
> > > 
> > > Can it be either a single good enough constant, or derived from
> > > another tunable, like busypoll_read.
> > 
> > I believe you meant busy_read here, is that right?
> > 
> > At any rate:
> > 
> >   - I don't think a single constant is appropriate, just as it
> >     wasn't appropriate for the existing mechanism
> >     (napi_defer_hard_irqs/gro_flush_timeout), and
> > 
> >   - Deriving the value from a pre-existing parameter to preserve the
> >     ABI, like busy_read, makes using this more confusing for users
> >     and complicates the API significantly.
> > 
> > I agree we should get the API right from the start; that's why we've
> > submit this as an RFC ;)
> > 
> > We are happy to take suggestions from the community, but, IMHO,
> > re-using an existing parameter for a different purpose only in
> > certain circumstances (if I understand your suggestions) is a much
> > worse choice than adding a new tunable that clearly states its
> > intended singular purpose.
> 
> Ack. I was thinking whether an epoll flag through your new epoll
> ioctl interface to toggle the IRQ suspension (and timer start)
> would be preferable. Because more fine grained.

I understand why you are asking about this and I think it would be
great if this were possible, but unfortunately it isn't.

epoll contexts can be associated with any NAPI ID, but the IRQ
suspension is NAPI specific.

As an example: imagine a user program creates an epoll context and
adds fds with NAPI ID 1111 to the context. It then issues the ioctl
to set the suspend timeout for that context. Then, for whatever
reason, the user app decides to remove all the fds and add new ones,
this time from NAPI ID 2222, which happens to be a different
net_device.

What does that mean for the suspend timeout? It's not clear to me
what the right behavior would be in this situation (does it persist?
does it get cleared when a new NAPI ID is added? etc) and it makes
the user API much more complicated, with many more edge cases and
possible bugs.

> Also, the value is likely dependent more on the expected duration
> of userspace processing? If so, it would be the same for all
> devices, so does a per-netdev value make sense?

There is presently no way to set values like gro_timeout,
defer_hard_irqs, or this new proposed value on a per-NAPI basis.
IMHO, that is really where all of these values should live.

I mentioned on the list previously (and also in the cover letter),
that time permitting, I think the correct evolution of this would be
to support per-NAPI settings (via netdev-genl, I assume) so that
user programs can set all 3 values on only the NAPIs they care
about.

Until that functionality is available, it would seem per-netdev is
the only way for this feature to be added at the present time. I
simply haven't had the time to add the above interface. This
feature we're proposing has demonstrable performance value, but it
doesn't seem sensible to block it until time permits me to add a
per-NAPI interface for all of these values given that we already
globally expose 2 of them.

That said, I appreciate the thoughtfulness of your replies and I am
open to other suggestions.

- Joe
Martin Karsten Aug. 17, 2024, 6:15 p.m. UTC | #24
On 2024-08-16 16:58, Willem de Bruijn wrote:
> Martin Karsten wrote:
>> On 2024-08-16 13:01, Willem de Bruijn wrote:
>>> Joe Damato wrote:
>>>> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote:
>>>>> Willem de Bruijn wrote:
>>>>>> Martin Karsten wrote:
>>>>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote:
>>>>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote:
>>>>>>>>>
>>>>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote:
>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote:
>>>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote:
>>>>>>>>>>>>>> On 08/12, Martin Karsten wrote:
>>>>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote:
>>>>>>>>>>>>>>>> On 08/12, Joe Damato wrote:
>>>>>>>>>>>>>>>>> Greetings:
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of
>>>>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is
>>>>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy
>>>>>>>>>>>>> to idle and back.
>>>>>>>>>>>>
>>>>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi
>>>>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout",
>>>>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL
>>>>>>>>>>>> which is more like "this particular napi_poll call is user busy polling".
>>>>>>>>>>>
>>>>>>>>>>> Then either the application needs to be polling all the time (wasting cpu
>>>>>>>>>>> cycles) or latencies will be determined by the timeout.
>>>>>>>> But if I understand correctly, this means that if the application
>>>>>>>> thread that is supposed
>>>>>>>> to do napi busy polling gets busy doing work on the new data/events in
>>>>>>>> userspace, napi polling
>>>>>>>> will not be done until the suspend_timeout triggers? Do you dispatch
>>>>>>>> work to a separate worker
>>>>>>>> threads, in userspace, from the thread that is doing epoll_wait?
>>>>>>>
>>>>>>> Yes, napi polling is suspended while the application is busy between
>>>>>>> epoll_wait calls. That's where the benefits are coming from.
>>>>>>>
>>>>>>> The consequences depend on the nature of the application and overall
>>>>>>> preferences for the system. If there's a "dominant" application for a
>>>>>>> number of queues and cores, the resulting latency for other background
>>>>>>> applications using the same queues might not be a problem at all.
>>>>>>>
>>>>>>> One other simple mitigation is limiting the number of events that each
>>>>>>> epoll_wait call accepts. Note that this batch size also determines the
>>>>>>> worst-case latency for the application in question, so there is a
>>>>>>> natural incentive to keep it limited.
>>>>>>>
>>>>>>> A more complex application design, like you suggest, might also be an
>>>>>>> option.
>>>>>>>
>>>>>>>>>>> Only when switching back and forth between polling and interrupts is it
>>>>>>>>>>> possible to get low latencies across a large spectrum of offered loads
>>>>>>>>>>> without burning cpu cycles at 100%.
>>>>>>>>>>
>>>>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment
>>>>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs.
>>>>>>>>>
>>>>>>>>> Thanks for probing and double-checking everything! Feedback is important
>>>>>>>>> for us to properly document our proposal.
>>>>>>>>>
>>>>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly
>>>>>>>>>> disabling napi during busy polling is super useful, but it would still
>>>>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on
>>>>>>>>>> a per-napi basis imho.
>>>>>>>> I agree, this would allow each napi queue to tune itself based on
>>>>>>>> heuristics. But I think
>>>>>>>> doing it through epoll independent interface makes more sense as Stan
>>>>>>>> suggested earlier.
>>>>>>>
>>>>>>> The question is whether to add a useful mechanism (one sysfs parameter
>>>>>>> and a few lines of code) that is optional, but with demonstrable and
>>>>>>> significant performance/efficiency improvements for an important class
>>>>>>> of applications - or wait for an uncertain future?
>>>>>>
>>>>>> The issue is that this one little change can never be removed, as it
>>>>>> becomes ABI.
>>>>>>
>>>>>> Let's get the right API from the start.
>>>>>>
>>>>>> Not sure that a global variable, or sysfs as API, is the right one.
>>>>>
>>>>> Sorry per-device, not global.
>>>>>
>>>>> My main concern is that it adds yet another user tunable integer, for
>>>>> which the right value is not obvious.
>>>>
>>>> This is a feature for advanced users just like SO_INCOMING_NAPI_ID
>>>> and countless other features.
>>>>
>>>> The value may not be obvious, but guidance (in the form of
>>>> documentation) can be provided.
>>>
>>> Okay. Could you share a stab at what that would look like?
>>
>> The timeout needs to be large enough that an application can get a
>> meaningful number of incoming requests processed without softirq
>> interference. At the same time, the timeout value determines the
>> worst-case delivery delay that a concurrent application using the same
>> queue(s) might experience. Please also see my response to Samiullah
>> quoted above. The specific circumstances and trade-offs might vary,
>> that's why a simple constant likely won't do.
> 
> Thanks. I really do mean this as an exercise of what documentation in
> Documentation/networking/napi.rst will look like. That helps makes the
> case that the interface is reasonably ease to use (even if only
> targeting advanced users).
> 
> How does a user measure how much time a process will spend on
> processing a meaningful number of incoming requests, for instance.
> In practice, probably just a hunch?

As an example, we measure around 1M QPS in our experiments, fully 
utilizing 8 cores and knowing that memcached is quite scalable. Thus we 
can conclude a single request takes about 8 us processing time on 
average. That has led us to a 20 us small timeout (gro_flush_timeout), 
enough to make sure that a single request is likely not interfered with, 
but otherwise as small as possible. If multiple requests arrive, the 
system will quickly switch back to polling mode.

At the other end, we have picked a very large irq_suspend_timeout of 
20,000 us to demonstrate that it does not negatively impact latency. 
This would cover 2,500 requests, which is likely excessive, but was 
chosen for demonstration purposes. One can easily measure the 
distribution of epoll_wait batch sizes and batch sizes as low as 64 are 
already very efficient, even in high-load situations.

Also see next paragraph.

> Playing devil's advocate some more: given that ethtool usecs have to
> be chosen with a similar trade-off between latency and efficiency,
> could a multiplicative factor of this (or gro_flush_timeout, same
> thing) be sufficient and easier to choose? The documentation does
> state that the value chosen must be >= gro_flush_timeout.

I believe this would take away flexibility without gaining much. You'd 
still want some sort of admin-controlled 'enable' flag, so you'd still 
need some kind of parameter.

When using our scheme, the factor between gro_flush_timeout and 
irq_suspend_timeout should *roughly* correspond to the maximum batch 
size that an application would process in one go (orders of magnitude, 
see above). This determines both the target application's worst-case 
latency as well as the worst-case latency of concurrent applications, if 
any, as mentioned previously. I believe the optimal factor will vary 
between different scenarios.

>>>>> If the only goal is to safely reenable interrupts when the application
>>>>> stops calling epoll_wait, does this have to be user tunable?
>>>>>
>>>>> Can it be either a single good enough constant, or derived from
>>>>> another tunable, like busypoll_read.
>>>>
>>>> I believe you meant busy_read here, is that right?
>>>>
>>>> At any rate:
>>>>
>>>>     - I don't think a single constant is appropriate, just as it
>>>>       wasn't appropriate for the existing mechanism
>>>>       (napi_defer_hard_irqs/gro_flush_timeout), and
>>>>
>>>>     - Deriving the value from a pre-existing parameter to preserve the
>>>>       ABI, like busy_read, makes using this more confusing for users
>>>>       and complicates the API significantly.
>>>>
>>>> I agree we should get the API right from the start; that's why we've
>>>> submit this as an RFC ;)
>>>>
>>>> We are happy to take suggestions from the community, but, IMHO,
>>>> re-using an existing parameter for a different purpose only in
>>>> certain circumstances (if I understand your suggestions) is a much
>>>> worse choice than adding a new tunable that clearly states its
>>>> intended singular purpose.
>>>
>>> Ack. I was thinking whether an epoll flag through your new epoll
>>> ioctl interface to toggle the IRQ suspension (and timer start)
>>> would be preferable. Because more fine grained.
>>
>> A value provided by an application through the epoll ioctl would not be
>> subject to admin oversight, so a misbehaving application could set an
>> arbitrary timeout value. A sysfs value needs to be set by an admin. The
>> ideal timeout value depends both on the particular target application as
>> well as concurrent applications using the same queue(s) - as sketched above.
> 
> I meant setting the value systemwide (or per-device), but opting in to
> the feature a binary epoll options. Really an epoll_wait flag, if we
> had flags.
> 
> Any admin privileged operations can also be protected at the epoll
> level by requiring CAP_NET_ADMIN too, of course. But fair point that
> this might operate in a multi-process environment, so values should
> not be hardcoded into the binaries.
> 
> Just asking questions to explore the option space so as not to settle
> on an API too soon. Given that, as said, we cannot remove it later.

I agree, but I believe we are converging? Also taking into account Joe's 
earlier response, given that the suspend mechanism dovetails so nicely 
with gro_flush_timeout and napi_defer_hard_irqs, it just seems natural 
to put irq_suspend_timeout at the same level and I haven't seen any 
strong reason to put it elsewhere.

>>> Also, the value is likely dependent more on the expected duration
>>> of userspace processing? If so, it would be the same for all
>>> devices, so does a per-netdev value make sense?
>>
>> It is per-netdev in the current proposal to be at the same granularity
>> as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension
>> operates at the same level/granularity. This allows for more control
>> than a global setting and it can be migrated to per-napi settings along
>> with gro_flush_timeout and napi_defer_hard_irqs when the time comes.
> 
> Ack, makes sense. Many of these design choices and their rationale are
> good to explicitly capture in the commit message.

Agreed.

Thanks,
Martin
Willem de Bruijn Aug. 18, 2024, 12:55 p.m. UTC | #25
> >>>> The value may not be obvious, but guidance (in the form of
> >>>> documentation) can be provided.
> >>>
> >>> Okay. Could you share a stab at what that would look like?
> >>
> >> The timeout needs to be large enough that an application can get a
> >> meaningful number of incoming requests processed without softirq
> >> interference. At the same time, the timeout value determines the
> >> worst-case delivery delay that a concurrent application using the same
> >> queue(s) might experience. Please also see my response to Samiullah
> >> quoted above. The specific circumstances and trade-offs might vary,
> >> that's why a simple constant likely won't do.
> > 
> > Thanks. I really do mean this as an exercise of what documentation in
> > Documentation/networking/napi.rst will look like. That helps makes the
> > case that the interface is reasonably ease to use (even if only
> > targeting advanced users).
> > 
> > How does a user measure how much time a process will spend on
> > processing a meaningful number of incoming requests, for instance.
> > In practice, probably just a hunch?
> 
> As an example, we measure around 1M QPS in our experiments, fully 
> utilizing 8 cores and knowing that memcached is quite scalable. Thus we 
> can conclude a single request takes about 8 us processing time on 
> average. That has led us to a 20 us small timeout (gro_flush_timeout), 
> enough to make sure that a single request is likely not interfered with, 
> but otherwise as small as possible. If multiple requests arrive, the 
> system will quickly switch back to polling mode.
> 
> At the other end, we have picked a very large irq_suspend_timeout of 
> 20,000 us to demonstrate that it does not negatively impact latency. 
> This would cover 2,500 requests, which is likely excessive, but was 
> chosen for demonstration purposes. One can easily measure the 
> distribution of epoll_wait batch sizes and batch sizes as low as 64 are 
> already very efficient, even in high-load situations.

Overall Ack on both your and Joe's responses.

epoll_wait disables the suspend if no events are found and ep_poll
would go to sleep. As the paper also hints, the timeout is only there
for misbehaving applications that stop calling epoll_wait, correct?
If so, then picking a value is not that critical, as long as not too
low to do meaningful work.

> Also see next paragraph.
> 
> > Playing devil's advocate some more: given that ethtool usecs have to
> > be chosen with a similar trade-off between latency and efficiency,
> > could a multiplicative factor of this (or gro_flush_timeout, same
> > thing) be sufficient and easier to choose? The documentation does
> > state that the value chosen must be >= gro_flush_timeout.
> 
> I believe this would take away flexibility without gaining much. You'd 
> still want some sort of admin-controlled 'enable' flag, so you'd still 
> need some kind of parameter.
> 
> When using our scheme, the factor between gro_flush_timeout and 
> irq_suspend_timeout should *roughly* correspond to the maximum batch 
> size that an application would process in one go (orders of magnitude, 
> see above). This determines both the target application's worst-case 
> latency as well as the worst-case latency of concurrent applications, if 
> any, as mentioned previously.

Oh is concurrent applications the argument against a very high
timeout?

> I believe the optimal factor will vary 
> between different scenarios.
> 
> >>>>> If the only goal is to safely reenable interrupts when the application
> >>>>> stops calling epoll_wait, does this have to be user tunable?
> >>>>>
> >>>>> Can it be either a single good enough constant, or derived from
> >>>>> another tunable, like busypoll_read.
> >>>>
> >>>> I believe you meant busy_read here, is that right?
> >>>>
> >>>> At any rate:
> >>>>
> >>>>     - I don't think a single constant is appropriate, just as it
> >>>>       wasn't appropriate for the existing mechanism
> >>>>       (napi_defer_hard_irqs/gro_flush_timeout), and
> >>>>
> >>>>     - Deriving the value from a pre-existing parameter to preserve the
> >>>>       ABI, like busy_read, makes using this more confusing for users
> >>>>       and complicates the API significantly.
> >>>>
> >>>> I agree we should get the API right from the start; that's why we've
> >>>> submit this as an RFC ;)
> >>>>
> >>>> We are happy to take suggestions from the community, but, IMHO,
> >>>> re-using an existing parameter for a different purpose only in
> >>>> certain circumstances (if I understand your suggestions) is a much
> >>>> worse choice than adding a new tunable that clearly states its
> >>>> intended singular purpose.
> >>>
> >>> Ack. I was thinking whether an epoll flag through your new epoll
> >>> ioctl interface to toggle the IRQ suspension (and timer start)
> >>> would be preferable. Because more fine grained.
> >>
> >> A value provided by an application through the epoll ioctl would not be
> >> subject to admin oversight, so a misbehaving application could set an
> >> arbitrary timeout value. A sysfs value needs to be set by an admin. The
> >> ideal timeout value depends both on the particular target application as
> >> well as concurrent applications using the same queue(s) - as sketched above.
> > 
> > I meant setting the value systemwide (or per-device), but opting in to
> > the feature a binary epoll options. Really an epoll_wait flag, if we
> > had flags.
> > 
> > Any admin privileged operations can also be protected at the epoll
> > level by requiring CAP_NET_ADMIN too, of course. But fair point that
> > this might operate in a multi-process environment, so values should
> > not be hardcoded into the binaries.
> > 
> > Just asking questions to explore the option space so as not to settle
> > on an API too soon. Given that, as said, we cannot remove it later.
> 
> I agree, but I believe we are converging? Also taking into account Joe's 
> earlier response, given that the suspend mechanism dovetails so nicely 
> with gro_flush_timeout and napi_defer_hard_irqs, it just seems natural 
> to put irq_suspend_timeout at the same level and I haven't seen any 
> strong reason to put it elsewhere.

Yes, this sounds good.
 
> >>> Also, the value is likely dependent more on the expected duration
> >>> of userspace processing? If so, it would be the same for all
> >>> devices, so does a per-netdev value make sense?
> >>
> >> It is per-netdev in the current proposal to be at the same granularity
> >> as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension
> >> operates at the same level/granularity. This allows for more control
> >> than a global setting and it can be migrated to per-napi settings along
> >> with gro_flush_timeout and napi_defer_hard_irqs when the time comes.
> > 
> > Ack, makes sense. Many of these design choices and their rationale are
> > good to explicitly capture in the commit message.
> 
> Agreed.
> 
> Thanks,
> Martin
Martin Karsten Aug. 18, 2024, 2:51 p.m. UTC | #26
On 2024-08-18 08:55, Willem de Bruijn wrote:
>>>>>> The value may not be obvious, but guidance (in the form of
>>>>>> documentation) can be provided.
>>>>>
>>>>> Okay. Could you share a stab at what that would look like?
>>>>
>>>> The timeout needs to be large enough that an application can get a
>>>> meaningful number of incoming requests processed without softirq
>>>> interference. At the same time, the timeout value determines the
>>>> worst-case delivery delay that a concurrent application using the same
>>>> queue(s) might experience. Please also see my response to Samiullah
>>>> quoted above. The specific circumstances and trade-offs might vary,
>>>> that's why a simple constant likely won't do.
>>>
>>> Thanks. I really do mean this as an exercise of what documentation in
>>> Documentation/networking/napi.rst will look like. That helps makes the
>>> case that the interface is reasonably ease to use (even if only
>>> targeting advanced users).
>>>
>>> How does a user measure how much time a process will spend on
>>> processing a meaningful number of incoming requests, for instance.
>>> In practice, probably just a hunch?
>>
>> As an example, we measure around 1M QPS in our experiments, fully
>> utilizing 8 cores and knowing that memcached is quite scalable. Thus we
>> can conclude a single request takes about 8 us processing time on
>> average. That has led us to a 20 us small timeout (gro_flush_timeout),
>> enough to make sure that a single request is likely not interfered with,
>> but otherwise as small as possible. If multiple requests arrive, the
>> system will quickly switch back to polling mode.
>>
>> At the other end, we have picked a very large irq_suspend_timeout of
>> 20,000 us to demonstrate that it does not negatively impact latency.
>> This would cover 2,500 requests, which is likely excessive, but was
>> chosen for demonstration purposes. One can easily measure the
>> distribution of epoll_wait batch sizes and batch sizes as low as 64 are
>> already very efficient, even in high-load situations.
> 
> Overall Ack on both your and Joe's responses.
> 
> epoll_wait disables the suspend if no events are found and ep_poll
> would go to sleep. As the paper also hints, the timeout is only there
> for misbehaving applications that stop calling epoll_wait, correct?
> If so, then picking a value is not that critical, as long as not too
> low to do meaningful work.

Correct.

>> Also see next paragraph.
>>
>>> Playing devil's advocate some more: given that ethtool usecs have to
>>> be chosen with a similar trade-off between latency and efficiency,
>>> could a multiplicative factor of this (or gro_flush_timeout, same
>>> thing) be sufficient and easier to choose? The documentation does
>>> state that the value chosen must be >= gro_flush_timeout.
>>
>> I believe this would take away flexibility without gaining much. You'd
>> still want some sort of admin-controlled 'enable' flag, so you'd still
>> need some kind of parameter.
>>
>> When using our scheme, the factor between gro_flush_timeout and
>> irq_suspend_timeout should *roughly* correspond to the maximum batch
>> size that an application would process in one go (orders of magnitude,
>> see above). This determines both the target application's worst-case
>> latency as well as the worst-case latency of concurrent applications, if
>> any, as mentioned previously.
> 
> Oh is concurrent applications the argument against a very high
> timeout?

Only in the error case. If suspend_irq_timeout is large enough as you 
point out above, then as long as the target application behaves well, 
its batching settings are the determining factor.

Thanks,
Martin
Jakub Kicinski Aug. 20, 2024, 2:07 a.m. UTC | #27
On Tue, 13 Aug 2024 21:14:40 -0400 Martin Karsten wrote:
> > What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
> > to be used with NICs which either don't have IRQ coalescing or have a
> > broken implementation. The timeout of 200usec should be perfectly within
> > range of what NICs can support.
> > 
> > If the NIC IRQ coalescing works, instead of adding a new timeout value
> > we could add a new deferral control (replacing defer_hard_irqs_count)
> > which would always kick in after seeing prefer_busy_poll() but also
> > not kick in if the busy poll harvested 0 packets.  
> Maybe I am missing something, but I believe this would have the same 
> problem that we describe for gro-timeout + defer-irq. When busy poll 
> does not harvest packets and the application thread is idle and goes to 
> sleep, it would then take up to 200 us to get the next interrupt. This 
> considerably increases tail latencies under low load.
> 
> In order get low latencies under low load, the NIC timeout would have to 
> be something like 20 us, but under high load the application thread will 
> be busy for longer than 20 us and the interrupt (and softirq) will come 
> too early and cause interference.

An FSM-like diagram would go a long way in clarifying things :)

> It is tempting to think of the second timeout as 0 and in fact re-enable 
> interrupts right away. We have tried it, but it leads to a lot of 
> interrupts and corresponding inefficiencies, since a system below 
> capacity frequently switches between busy and idle. Using a small 
> timeout (20 us) for modest deferral and batching when idle is a lot more 
> efficient.

I see. I think we are on the same page. What I was suggesting is to use
the HW timer instead of the short timer. But I suspect the NIC you're
using isn't really good at clearing IRQs before unmasking. Meaning that
when you try to reactivate HW control there's already an IRQ pending
and it fires pointlessly. That matches my experience with mlx5. 
If the NIC driver was to clear the IRQ state before running the NAPI
loop, we would have no pending IRQ by the time we unmask and activate
HW IRQs.

Sorry for the delay.
Jakub Kicinski Aug. 20, 2024, 2:36 a.m. UTC | #28
On Sun, 18 Aug 2024 10:51:04 -0400 Martin Karsten wrote:
> >> I believe this would take away flexibility without gaining much. You'd
> >> still want some sort of admin-controlled 'enable' flag, so you'd still
> >> need some kind of parameter.
> >>
> >> When using our scheme, the factor between gro_flush_timeout and
> >> irq_suspend_timeout should *roughly* correspond to the maximum batch
> >> size that an application would process in one go (orders of magnitude,
> >> see above). This determines both the target application's worst-case
> >> latency as well as the worst-case latency of concurrent applications, if
> >> any, as mentioned previously.  
> > 
> > Oh is concurrent applications the argument against a very high
> > timeout?  
> 
> Only in the error case. If suspend_irq_timeout is large enough as you 
> point out above, then as long as the target application behaves well, 
> its batching settings are the determining factor.

Since the discussion is still sort of going on let me ask something
potentially stupid (I haven't read the paper, yet). Are the cores
assumed to be fully isolated (ergo the application can only yield 
to the idle thread)? Do we not have to worry about the scheduler
deciding to schedule the process out involuntarily?
Martin Karsten Aug. 20, 2024, 2:27 p.m. UTC | #29
On 2024-08-19 22:07, Jakub Kicinski wrote:
> On Tue, 13 Aug 2024 21:14:40 -0400 Martin Karsten wrote:
>>> What about NIC interrupt coalescing. defer_hard_irqs_count was supposed
>>> to be used with NICs which either don't have IRQ coalescing or have a
>>> broken implementation. The timeout of 200usec should be perfectly within
>>> range of what NICs can support.
>>>
>>> If the NIC IRQ coalescing works, instead of adding a new timeout value
>>> we could add a new deferral control (replacing defer_hard_irqs_count)
>>> which would always kick in after seeing prefer_busy_poll() but also
>>> not kick in if the busy poll harvested 0 packets.
>> Maybe I am missing something, but I believe this would have the same
>> problem that we describe for gro-timeout + defer-irq. When busy poll
>> does not harvest packets and the application thread is idle and goes to
>> sleep, it would then take up to 200 us to get the next interrupt. This
>> considerably increases tail latencies under low load.
>>
>> In order get low latencies under low load, the NIC timeout would have to
>> be something like 20 us, but under high load the application thread will
>> be busy for longer than 20 us and the interrupt (and softirq) will come
>> too early and cause interference.
> 
> An FSM-like diagram would go a long way in clarifying things :)

I agree the suspend mechanism is not trivial and the implementation is 
subtle. It has frequently made our heads hurt while developing this. We 
will take a long hard look at our cover letter and produce other 
documentation to hopefully provide clear explanations.

>> It is tempting to think of the second timeout as 0 and in fact re-enable
>> interrupts right away. We have tried it, but it leads to a lot of
>> interrupts and corresponding inefficiencies, since a system below
>> capacity frequently switches between busy and idle. Using a small
>> timeout (20 us) for modest deferral and batching when idle is a lot more
>> efficient.
> 
> I see. I think we are on the same page. What I was suggesting is to use
> the HW timer instead of the short timer. But I suspect the NIC you're
> using isn't really good at clearing IRQs before unmasking. Meaning that
> when you try to reactivate HW control there's already an IRQ pending
> and it fires pointlessly. That matches my experience with mlx5.
> If the NIC driver was to clear the IRQ state before running the NAPI
> loop, we would have no pending IRQ by the time we unmask and activate
> HW IRQs.

I believe there are additional issues. The problem is that the long 
timeout must engage if and only if prefer-busy is active.

When using NIC coalescing for the short timeout (without gro/defer), an 
interrupt after an idle period will trigger softirq, which will run napi 
polling. At this point, prefer-busy is not active, so NIC interrupts 
would be re-enabled. Then it is not possible for the longer timeout to 
interject to switch control back to polling. In other words, only by 
using the software timer for the short timeout, it is possible to extend 
the timeout without having to reprogram the NIC timer or reach down 
directly and disable interrupts.

Using gro_flush_timeout for the long timeout also has problems, for the 
same underlying reason. In the current napi implementation, 
gro_flush_timeout is not tied to prefer-busy. We'd either have to change 
that and in the process modify the existing deferral mechanism, or 
introduce a state variable to determine whether gro_flush_timeout is 
used as long timeout for irq suspend or whether it is used for its 
default purpose. In an earlier version, we did try something similar to 
the latter and made it work, but it ends up being a lot more convoluted 
than our current proposal.

Thanks,
Martin
Martin Karsten Aug. 20, 2024, 2:28 p.m. UTC | #30
On 2024-08-19 22:36, Jakub Kicinski wrote:
> On Sun, 18 Aug 2024 10:51:04 -0400 Martin Karsten wrote:
>>>> I believe this would take away flexibility without gaining much. You'd
>>>> still want some sort of admin-controlled 'enable' flag, so you'd still
>>>> need some kind of parameter.
>>>>
>>>> When using our scheme, the factor between gro_flush_timeout and
>>>> irq_suspend_timeout should *roughly* correspond to the maximum batch
>>>> size that an application would process in one go (orders of magnitude,
>>>> see above). This determines both the target application's worst-case
>>>> latency as well as the worst-case latency of concurrent applications, if
>>>> any, as mentioned previously.
>>>
>>> Oh is concurrent applications the argument against a very high
>>> timeout?
>>
>> Only in the error case. If suspend_irq_timeout is large enough as you
>> point out above, then as long as the target application behaves well,
>> its batching settings are the determining factor.
> 
> Since the discussion is still sort of going on let me ask something
> potentially stupid (I haven't read the paper, yet). Are the cores
> assumed to be fully isolated (ergo the application can only yield
> to the idle thread)? Do we not have to worry about the scheduler
> deciding to schedule the process out involuntarily?

That shouldn't be a problem. If the next thread(s) can make progress, 
nothing is lost. If the next thread(s) cannot make progress, for example 
waiting for network I/O, they will block and the target application 
thread will run again. If another thread is busy-looping on network I/O, 
I would argue that having multiple busy-looping threads competing for 
the same core is probably not a good idea anyway.

Thanks,
Martin