Message ID | 20240812125717.413108-1-jdamato@fastly.com (mailing list archive) |
---|---|
Headers | show |
Series | Suspend IRQs during preferred busy poll | expand |
On 08/12, Joe Damato wrote: > Greetings: > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > ways of reducing tail latency when using epoll-based busy poll and we'd > love to get feedback from the list on the code in this series. This is > the idea I mentioned at netdev conf, for those who were there. Barring > any major issues, we hope to submit this officially shortly after RFC. > > The basic idea for suspending IRQs in this manner was described in an > earlier paper presented at Sigmetrics 2024 [1]. Let me explicitly call out the paper. Very nice analysis! > Previously, commit 18e2bf0edf4d ("eventpoll: Add epoll ioctl for > epoll_params") introduced the ability to enable or disable preferred > busy poll mode on a specific epoll context using an ioctl > (EPIOCSPARAMS). > > This series extends preferred busy poll mode by adding a sysfs parameter, > irq_suspend_timeout, which when used in combination with preferred busy > poll suspends device IRQs up to irq_suspend_timeout nanoseconds. > > Important call outs: > - Enabling per epoll-context preferred busy poll will now effectively > lead to a nonblocking iteration through napi_busy_loop, even when > busy_poll_usecs is 0. See patch 4. > > - Patches apply cleanly on net-next commit c4e82c025b3f ("net: dsa: > microchip: ksz9477: split half-duplex monitoring function"), but > may need to be respun if/when commit b4988e3bd1f0 ("eventpoll: Annotate > data-race of busy_poll_usecs") picked up by the vfs folks makes its way > into net-next. > > - In the future, time permitting, I hope to enable support for > napi_defer_hard_irqs, gro_flush_timeout (introduced in commit > 6f8b12d661d0 ("net: napi: add hard irqs deferral feature")), and > irq_suspend_timeout (introduced in this series) on a per-NAPI basis > (presumably via netdev-genl). > > ~ Description of the changes > > The overall idea is that IRQ suspension is introduced via a sysfs > parameter which controls the maximum time that IRQs can be suspended. > > Here's how it is intended to work: > - An administrator sets the existing sysfs parameters for > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > - An administrator sets the new sysfs parameter irq_suspend_timeout > to a larger value than gro-timeout to enable IRQ suspension. Can you expand more on what's the problem with the existing gro_flush_timeout? Is it defer_hard_irqs_count? Or you want a separate timeout only for the perfer_busy_poll case(why?)? Because looking at the first two patches, you essentially replace all usages of gro_flush_timeout with a new variable and I don't see how it helps. Maybe expand more on what code paths are we trying to improve? Existing busy polling code is not super readable, so would be nice to simplify it a bit in the process (if possible) instead of adding one more tunable. > - The user application issues the existing epoll ioctl to set the > prefer_busy_poll flag on the epoll context. > > - The user application then calls epoll_wait to busy poll for network > events, as it normally would. > > - If epoll_wait returns events to userland, IRQ are suspended for the > duration of irq_suspend_timeout. > > - If epoll_wait finds no events and the thread is about to go to > sleep, IRQ handling using gro_flush_timeout and defer_hard_irqs is > resumed. > > As long as epoll_wait is retrieving events, IRQs (and softirq > processing) for the NAPI being polled remain disabled. Unless IRQ > suspension is continued by subsequent calls to epoll_wait, it > automatically times out after the irq_suspend_timeout timer expires. > > When network traffic reduces, eventually a busy poll loop in the kernel > will retrieve no data. When this occurs, regular deferral using > gro_flush_timeout for the polled NAPI is immediately re-enabled. Regular > deferral is also immediately re-enabled when the epoll context is > destroyed. > > ~ Benchmark configs & descriptions > > These changes were benchmarked with memcached [2] using the > benchmarking tool mutilate [3]. > > To facilitate benchmarking, a small patch [4] was applied to > memcached 1.6.29 (the latest memcached release as of this RFC) to allow > setting per-epoll context preferred busy poll and other settings > via environment variables. > > Multiple scenarios were benchmarked as described below > and the scripts used for producing these results can be found on > github [5]. > > (note: all scenarios use NAPI-based traffic splitting via SO_INCOMING_ID > by passing -N to memcached): > > - base: Other than NAPI-based traffic splitting, no other options are > enabled. > - busy: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to 200,000 > - enable busy poll via the existing ioctl (busy_poll_usecs = 64, > busy_poll_budget = 64, prefer_busy_poll = true) > - deferX: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to X,000 [..] > - suspendX: > - set defer_hard_irqs to 100 > - set gro_flush_timeout to X,000 > - set irq_suspend_timeout to 20,000,000 > - enable busy poll via the existing ioctl (busy_poll_usecs = 0, > busy_poll_budget = 64, prefer_busy_poll = true) What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback to busy_poll sysctl value?
On 2024-08-12 16:19, Stanislav Fomichev wrote: > On 08/12, Joe Damato wrote: >> Greetings: >> >> Martin Karsten (CC'd) and I have been collaborating on some ideas about >> ways of reducing tail latency when using epoll-based busy poll and we'd >> love to get feedback from the list on the code in this series. This is >> the idea I mentioned at netdev conf, for those who were there. Barring >> any major issues, we hope to submit this officially shortly after RFC. >> >> The basic idea for suspending IRQs in this manner was described in an >> earlier paper presented at Sigmetrics 2024 [1]. > > Let me explicitly call out the paper. Very nice analysis! Thank you! [snip] >> Here's how it is intended to work: >> - An administrator sets the existing sysfs parameters for >> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. >> >> - An administrator sets the new sysfs parameter irq_suspend_timeout >> to a larger value than gro-timeout to enable IRQ suspension. > > Can you expand more on what's the problem with the existing gro_flush_timeout? > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > perfer_busy_poll case(why?)? Because looking at the first two patches, > you essentially replace all usages of gro_flush_timeout with a new variable > and I don't see how it helps. gro-flush-timeout (in combination with defer-hard-irqs) is the default irq deferral mechanism and as such, always active when configured. Its static periodic softirq processing leads to a situation where: - A long gro-flush-timeout causes high latencies when load is sufficiently below capacity, or - a short gro-flush-timeout causes overhead when softirq execution asynchronously competes with application processing at high load. The shortcomings of this are documented (to some extent) by our experiments. See defer20 working well at low load, but having problems at high load, while defer200 having higher latency at low load. irq-suspend-timeout is only active when an application uses prefer-busy-polling and in that case, produces a nice alternating pattern of application processing and networking processing (similar to what we describe in the paper). This then works well with both low and high load. > Maybe expand more on what code paths are we trying to improve? Existing > busy polling code is not super readable, so would be nice to simplify > it a bit in the process (if possible) instead of adding one more tunable. There are essentially three possible loops for network processing: 1) hardirq -> softirq -> napi poll; this is the baseline functionality 2) timer -> softirq -> napi poll; this is deferred irq processing scheme with the shortcomings described above 3) epoll -> busy-poll -> napi poll If a system is configured for 1), not much can be done, as it is difficult to interject anything into this loop without adding state and side effects. This is what we tried for the paper, but it ended up being a hack. If however the system is configured for irq deferral, Loops 2) and 3) "wrestle" with each other for control. Injecting the larger irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour of Loop 3) and creates the nice pattern describe above. [snip] >> - suspendX: >> - set defer_hard_irqs to 100 >> - set gro_flush_timeout to X,000 >> - set irq_suspend_timeout to 20,000,000 >> - enable busy poll via the existing ioctl (busy_poll_usecs = 0, >> busy_poll_budget = 64, prefer_busy_poll = true) > > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback > to busy_poll sysctl value? Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll (sysctl) or busy_poll_usecs is nonzero. However, this might lead to busy-polling even when the application does not actually need or want it. Only one iteration through the busy loop is needed to make the new scheme work. Additional napi busy polling over and above is optional. Thanks, Martin
On 08/12, Martin Karsten wrote: > On 2024-08-12 16:19, Stanislav Fomichev wrote: > > On 08/12, Joe Damato wrote: > > > Greetings: > > > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > > > ways of reducing tail latency when using epoll-based busy poll and we'd > > > love to get feedback from the list on the code in this series. This is > > > the idea I mentioned at netdev conf, for those who were there. Barring > > > any major issues, we hope to submit this officially shortly after RFC. > > > > > > The basic idea for suspending IRQs in this manner was described in an > > > earlier paper presented at Sigmetrics 2024 [1]. > > > > Let me explicitly call out the paper. Very nice analysis! > > Thank you! > > [snip] > > > > Here's how it is intended to work: > > > - An administrator sets the existing sysfs parameters for > > > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > > > > > - An administrator sets the new sysfs parameter irq_suspend_timeout > > > to a larger value than gro-timeout to enable IRQ suspension. > > > > Can you expand more on what's the problem with the existing gro_flush_timeout? > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > > perfer_busy_poll case(why?)? Because looking at the first two patches, > > you essentially replace all usages of gro_flush_timeout with a new variable > > and I don't see how it helps. > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq > deferral mechanism and as such, always active when configured. Its static > periodic softirq processing leads to a situation where: > > - A long gro-flush-timeout causes high latencies when load is sufficiently > below capacity, or > > - a short gro-flush-timeout causes overhead when softirq execution > asynchronously competes with application processing at high load. > > The shortcomings of this are documented (to some extent) by our experiments. > See defer20 working well at low load, but having problems at high load, > while defer200 having higher latency at low load. > > irq-suspend-timeout is only active when an application uses > prefer-busy-polling and in that case, produces a nice alternating pattern of > application processing and networking processing (similar to what we > describe in the paper). This then works well with both low and high load. So you only want it for the prefer-busy-pollingc case, makes sense. I was a bit confused by the difference between defer200 and suspend200, but now I see that defer200 does not enable busypoll. I'm assuming that if you enable busypool in defer200 case, the numbers should be similar to suspend200 (ignoring potentially affecting non-busypolling queues due to higher gro_flush_timeout). > > Maybe expand more on what code paths are we trying to improve? Existing > > busy polling code is not super readable, so would be nice to simplify > > it a bit in the process (if possible) instead of adding one more tunable. > > There are essentially three possible loops for network processing: > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > with the shortcomings described above > > 3) epoll -> busy-poll -> napi poll > > If a system is configured for 1), not much can be done, as it is difficult > to interject anything into this loop without adding state and side effects. > This is what we tried for the paper, but it ended up being a hack. > > If however the system is configured for irq deferral, Loops 2) and 3) > "wrestle" with each other for control. Injecting the larger > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > of Loop 3) and creates the nice pattern describe above. And you hit (2) when the epoll goes to sleep and/or when the userspace isn't fast enough to keep up with the timer, presumably? I wonder if need to use this opportunity and do proper API as Joe hints in the cover letter. Something over netlink to say "I'm gonna busy-poll on this queue / napi_id and with this timeout". And then we can essentially make gro_flush_timeout per queue (and avoid napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels too hacky already :-( > [snip] > > > > - suspendX: > > > - set defer_hard_irqs to 100 > > > - set gro_flush_timeout to X,000 > > > - set irq_suspend_timeout to 20,000,000 > > > - enable busy poll via the existing ioctl (busy_poll_usecs = 0, > > > busy_poll_budget = 64, prefer_busy_poll = true) > > > > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback > > to busy_poll sysctl value? > > Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll > (sysctl) or busy_poll_usecs is nonzero. However, this might lead to > busy-polling even when the application does not actually need or want it. > Only one iteration through the busy loop is needed to make the new scheme > work. Additional napi busy polling over and above is optional. Ack, thanks, was trying to understand why not stay with busy_poll_usecs=64 for consistency. But I guess you were just trying to show that patch 4/5 works.
On 2024-08-12 19:03, Stanislav Fomichev wrote: > On 08/12, Martin Karsten wrote: >> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>> On 08/12, Joe Damato wrote: >>>> Greetings: >>>> >>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about >>>> ways of reducing tail latency when using epoll-based busy poll and we'd >>>> love to get feedback from the list on the code in this series. This is >>>> the idea I mentioned at netdev conf, for those who were there. Barring >>>> any major issues, we hope to submit this officially shortly after RFC. >>>> >>>> The basic idea for suspending IRQs in this manner was described in an >>>> earlier paper presented at Sigmetrics 2024 [1]. >>> >>> Let me explicitly call out the paper. Very nice analysis! >> >> Thank you! >> >> [snip] >> >>>> Here's how it is intended to work: >>>> - An administrator sets the existing sysfs parameters for >>>> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. >>>> >>>> - An administrator sets the new sysfs parameter irq_suspend_timeout >>>> to a larger value than gro-timeout to enable IRQ suspension. >>> >>> Can you expand more on what's the problem with the existing gro_flush_timeout? >>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the >>> perfer_busy_poll case(why?)? Because looking at the first two patches, >>> you essentially replace all usages of gro_flush_timeout with a new variable >>> and I don't see how it helps. >> >> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq >> deferral mechanism and as such, always active when configured. Its static >> periodic softirq processing leads to a situation where: >> >> - A long gro-flush-timeout causes high latencies when load is sufficiently >> below capacity, or >> >> - a short gro-flush-timeout causes overhead when softirq execution >> asynchronously competes with application processing at high load. >> >> The shortcomings of this are documented (to some extent) by our experiments. >> See defer20 working well at low load, but having problems at high load, >> while defer200 having higher latency at low load. >> >> irq-suspend-timeout is only active when an application uses >> prefer-busy-polling and in that case, produces a nice alternating pattern of >> application processing and networking processing (similar to what we >> describe in the paper). This then works well with both low and high load. > > So you only want it for the prefer-busy-pollingc case, makes sense. I was > a bit confused by the difference between defer200 and suspend200, > but now I see that defer200 does not enable busypoll. > > I'm assuming that if you enable busypool in defer200 case, the numbers > should be similar to suspend200 (ignoring potentially affecting > non-busypolling queues due to higher gro_flush_timeout). defer200 + napi busy poll is essentially what we labelled "busy" and it does not perform as well, since it still suffers interference between application and softirq processing. >>> Maybe expand more on what code paths are we trying to improve? Existing >>> busy polling code is not super readable, so would be nice to simplify >>> it a bit in the process (if possible) instead of adding one more tunable. >> >> There are essentially three possible loops for network processing: >> >> 1) hardirq -> softirq -> napi poll; this is the baseline functionality >> >> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme >> with the shortcomings described above >> >> 3) epoll -> busy-poll -> napi poll >> >> If a system is configured for 1), not much can be done, as it is difficult >> to interject anything into this loop without adding state and side effects. >> This is what we tried for the paper, but it ended up being a hack. >> >> If however the system is configured for irq deferral, Loops 2) and 3) >> "wrestle" with each other for control. Injecting the larger >> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour >> of Loop 3) and creates the nice pattern describe above. > > And you hit (2) when the epoll goes to sleep and/or when the userspace > isn't fast enough to keep up with the timer, presumably? I wonder > if need to use this opportunity and do proper API as Joe hints in the > cover letter. Something over netlink to say "I'm gonna busy-poll on > this queue / napi_id and with this timeout". And then we can essentially make > gro_flush_timeout per queue (and avoid > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > too hacky already :-( If someone would implement the necessary changes to make these parameters per-napi, this would improve things further, but note that the current proposal gives strong performance across a range of workloads, which is otherwise difficult to impossible to achieve. Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of an individual queue or application to make sure that IRQ suspension is enabled/disabled right away when the state of the system changes from busy to idle and back. >> [snip] >> >>>> - suspendX: >>>> - set defer_hard_irqs to 100 >>>> - set gro_flush_timeout to X,000 >>>> - set irq_suspend_timeout to 20,000,000 >>>> - enable busy poll via the existing ioctl (busy_poll_usecs = 0, >>>> busy_poll_budget = 64, prefer_busy_poll = true) >>> >>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback >>> to busy_poll sysctl value? >> >> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll >> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to >> busy-polling even when the application does not actually need or want it. >> Only one iteration through the busy loop is needed to make the new scheme >> work. Additional napi busy polling over and above is optional. > > Ack, thanks, was trying to understand why not stay with > busy_poll_usecs=64 for consistency. But I guess you were just > trying to show that patch 4/5 works. Right, and we would potentially be wasting CPU cycles by adding more busy-looping. Thanks, Martin
On 08/12, Martin Karsten wrote: > On 2024-08-12 19:03, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > > > On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > > On 08/12, Joe Damato wrote: > > > > > Greetings: > > > > > > > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > > > > > ways of reducing tail latency when using epoll-based busy poll and we'd > > > > > love to get feedback from the list on the code in this series. This is > > > > > the idea I mentioned at netdev conf, for those who were there. Barring > > > > > any major issues, we hope to submit this officially shortly after RFC. > > > > > > > > > > The basic idea for suspending IRQs in this manner was described in an > > > > > earlier paper presented at Sigmetrics 2024 [1]. > > > > > > > > Let me explicitly call out the paper. Very nice analysis! > > > > > > Thank you! > > > > > > [snip] > > > > > > > > Here's how it is intended to work: > > > > > - An administrator sets the existing sysfs parameters for > > > > > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > > > > > > > > > - An administrator sets the new sysfs parameter irq_suspend_timeout > > > > > to a larger value than gro-timeout to enable IRQ suspension. > > > > > > > > Can you expand more on what's the problem with the existing gro_flush_timeout? > > > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > > > > perfer_busy_poll case(why?)? Because looking at the first two patches, > > > > you essentially replace all usages of gro_flush_timeout with a new variable > > > > and I don't see how it helps. > > > > > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq > > > deferral mechanism and as such, always active when configured. Its static > > > periodic softirq processing leads to a situation where: > > > > > > - A long gro-flush-timeout causes high latencies when load is sufficiently > > > below capacity, or > > > > > > - a short gro-flush-timeout causes overhead when softirq execution > > > asynchronously competes with application processing at high load. > > > > > > The shortcomings of this are documented (to some extent) by our experiments. > > > See defer20 working well at low load, but having problems at high load, > > > while defer200 having higher latency at low load. > > > > > > irq-suspend-timeout is only active when an application uses > > > prefer-busy-polling and in that case, produces a nice alternating pattern of > > > application processing and networking processing (similar to what we > > > describe in the paper). This then works well with both low and high load. > > > > So you only want it for the prefer-busy-pollingc case, makes sense. I was > > a bit confused by the difference between defer200 and suspend200, > > but now I see that defer200 does not enable busypoll. > > > > I'm assuming that if you enable busypool in defer200 case, the numbers > > should be similar to suspend200 (ignoring potentially affecting > > non-busypolling queues due to higher gro_flush_timeout). > > defer200 + napi busy poll is essentially what we labelled "busy" and it does > not perform as well, since it still suffers interference between application > and softirq processing. With all your patches applied? Why? Userspace not keeping up? > > > > Maybe expand more on what code paths are we trying to improve? Existing > > > > busy polling code is not super readable, so would be nice to simplify > > > > it a bit in the process (if possible) instead of adding one more tunable. > > > > > > There are essentially three possible loops for network processing: > > > > > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality > > > > > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > > > with the shortcomings described above > > > > > > 3) epoll -> busy-poll -> napi poll > > > > > > If a system is configured for 1), not much can be done, as it is difficult > > > to interject anything into this loop without adding state and side effects. > > > This is what we tried for the paper, but it ended up being a hack. > > > > > > If however the system is configured for irq deferral, Loops 2) and 3) > > > "wrestle" with each other for control. Injecting the larger > > > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > > > of Loop 3) and creates the nice pattern describe above. > > > > And you hit (2) when the epoll goes to sleep and/or when the userspace > > isn't fast enough to keep up with the timer, presumably? I wonder > > if need to use this opportunity and do proper API as Joe hints in the > > cover letter. Something over netlink to say "I'm gonna busy-poll on > > this queue / napi_id and with this timeout". And then we can essentially make > > gro_flush_timeout per queue (and avoid > > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > > too hacky already :-( > > If someone would implement the necessary changes to make these parameters > per-napi, this would improve things further, but note that the current > proposal gives strong performance across a range of workloads, which is > otherwise difficult to impossible to achieve. Let's see what other people have to say. But we tried to do a similar setup at Google recently and getting all these parameters right was not trivial. Joe's recent patch series to push some of these into epoll context are a step in the right direction. It would be nice to have more explicit interface to express busy poling preference for the users vs chasing a bunch of global tunables and fighting against softirq wakups. > Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > an individual queue or application to make sure that IRQ suspension is > enabled/disabled right away when the state of the system changes from busy > to idle and back. Can we not handle everything in napi_busy_loop? If we can mark some napi contexts as "explicitly polled by userspace with a larger defer timeout", we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL which is more like "this particular napi_poll call is user busy polling". > > > [snip] > > > > > > > > - suspendX: > > > > > - set defer_hard_irqs to 100 > > > > > - set gro_flush_timeout to X,000 > > > > > - set irq_suspend_timeout to 20,000,000 > > > > > - enable busy poll via the existing ioctl (busy_poll_usecs = 0, > > > > > busy_poll_budget = 64, prefer_busy_poll = true) > > > > > > > > What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback > > > > to busy_poll sysctl value? > > > > > > Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll > > > (sysctl) or busy_poll_usecs is nonzero. However, this might lead to > > > busy-polling even when the application does not actually need or want it. > > > Only one iteration through the busy loop is needed to make the new scheme > > > work. Additional napi busy polling over and above is optional. > > > > Ack, thanks, was trying to understand why not stay with > > busy_poll_usecs=64 for consistency. But I guess you were just > > trying to show that patch 4/5 works. > > Right, and we would potentially be wasting CPU cycles by adding more > busy-looping. Or potentially improving the latency more if you happen to get more packets during busy_poll_usecs duration? I'd imagine some applications might prefer to 100% busy poll without ever going to sleep (that would probably require getting rid of napi_id tracking in epoll, but that's a different story).
On 2024-08-12 21:54, Stanislav Fomichev wrote: > On 08/12, Martin Karsten wrote: >> On 2024-08-12 19:03, Stanislav Fomichev wrote: >>> On 08/12, Martin Karsten wrote: >>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>>>> On 08/12, Joe Damato wrote: >>>>>> Greetings: >>>>>> >>>>>> Martin Karsten (CC'd) and I have been collaborating on some ideas about >>>>>> ways of reducing tail latency when using epoll-based busy poll and we'd >>>>>> love to get feedback from the list on the code in this series. This is >>>>>> the idea I mentioned at netdev conf, for those who were there. Barring >>>>>> any major issues, we hope to submit this officially shortly after RFC. >>>>>> >>>>>> The basic idea for suspending IRQs in this manner was described in an >>>>>> earlier paper presented at Sigmetrics 2024 [1]. >>>>> >>>>> Let me explicitly call out the paper. Very nice analysis! >>>> >>>> Thank you! >>>> >>>> [snip] >>>> >>>>>> Here's how it is intended to work: >>>>>> - An administrator sets the existing sysfs parameters for >>>>>> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. >>>>>> >>>>>> - An administrator sets the new sysfs parameter irq_suspend_timeout >>>>>> to a larger value than gro-timeout to enable IRQ suspension. >>>>> >>>>> Can you expand more on what's the problem with the existing gro_flush_timeout? >>>>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the >>>>> perfer_busy_poll case(why?)? Because looking at the first two patches, >>>>> you essentially replace all usages of gro_flush_timeout with a new variable >>>>> and I don't see how it helps. >>>> >>>> gro-flush-timeout (in combination with defer-hard-irqs) is the default irq >>>> deferral mechanism and as such, always active when configured. Its static >>>> periodic softirq processing leads to a situation where: >>>> >>>> - A long gro-flush-timeout causes high latencies when load is sufficiently >>>> below capacity, or >>>> >>>> - a short gro-flush-timeout causes overhead when softirq execution >>>> asynchronously competes with application processing at high load. >>>> >>>> The shortcomings of this are documented (to some extent) by our experiments. >>>> See defer20 working well at low load, but having problems at high load, >>>> while defer200 having higher latency at low load. >>>> >>>> irq-suspend-timeout is only active when an application uses >>>> prefer-busy-polling and in that case, produces a nice alternating pattern of >>>> application processing and networking processing (similar to what we >>>> describe in the paper). This then works well with both low and high load. >>> >>> So you only want it for the prefer-busy-pollingc case, makes sense. I was >>> a bit confused by the difference between defer200 and suspend200, >>> but now I see that defer200 does not enable busypoll. >>> >>> I'm assuming that if you enable busypool in defer200 case, the numbers >>> should be similar to suspend200 (ignoring potentially affecting >>> non-busypolling queues due to higher gro_flush_timeout). >> >> defer200 + napi busy poll is essentially what we labelled "busy" and it does >> not perform as well, since it still suffers interference between application >> and softirq processing. > > With all your patches applied? Why? Userspace not keeping up? Note our "busy" case does not utilize our patches. As illustrated by our performance numbers, its performance is better than the base case, but at the cost of higher cpu utilization and it's still not as good as suspend20. Explanation (conjecture): It boils down to having to set a particular static value for gro-flush-timeout that is then always active. If busy-poll + application processing takes longer than this timeout, the next softirq runs while the application is still active, which causes interference. Once a softirq runs, the irq-loop (Loop 2) takes control. When the application thread comes back to epoll_wait, it already finds data, thus ep_poll does not run napi_busy_poll at all, thus the irq-loop stays in control. This continues until by chance the application finds no readily available data when calling epoll_wait and ep_poll runs another napi_busy_poll. Then the system switches back to busy-polling mode. So essentially the system non-deterministically alternates between busy-polling and irq deferral. irq deferral determines the high-order tail latencies, but there is still enough interference to make a difference. It's not as bad as in the base case, but not as good as properly controlled irq suspension. >>>>> Maybe expand more on what code paths are we trying to improve? Existing >>>>> busy polling code is not super readable, so would be nice to simplify >>>>> it a bit in the process (if possible) instead of adding one more tunable. >>>> >>>> There are essentially three possible loops for network processing: >>>> >>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality >>>> >>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme >>>> with the shortcomings described above >>>> >>>> 3) epoll -> busy-poll -> napi poll >>>> >>>> If a system is configured for 1), not much can be done, as it is difficult >>>> to interject anything into this loop without adding state and side effects. >>>> This is what we tried for the paper, but it ended up being a hack. >>>> >>>> If however the system is configured for irq deferral, Loops 2) and 3) >>>> "wrestle" with each other for control. Injecting the larger >>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour >>>> of Loop 3) and creates the nice pattern describe above. >>> >>> And you hit (2) when the epoll goes to sleep and/or when the userspace >>> isn't fast enough to keep up with the timer, presumably? I wonder >>> if need to use this opportunity and do proper API as Joe hints in the >>> cover letter. Something over netlink to say "I'm gonna busy-poll on >>> this queue / napi_id and with this timeout". And then we can essentially make >>> gro_flush_timeout per queue (and avoid >>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels >>> too hacky already :-( >> >> If someone would implement the necessary changes to make these parameters >> per-napi, this would improve things further, but note that the current >> proposal gives strong performance across a range of workloads, which is >> otherwise difficult to impossible to achieve. > > Let's see what other people have to say. But we tried to do a similar > setup at Google recently and getting all these parameters right > was not trivial. Joe's recent patch series to push some of these into > epoll context are a step in the right direction. It would be nice to > have more explicit interface to express busy poling preference for > the users vs chasing a bunch of global tunables and fighting against softirq > wakups. One of the goals of this patch set is to reduce parameter tuning and make the parameter setting independent of workload dynamics, so it should make things easier. This is of course notwithstanding that per-napi settings would be even better. If you are able to share more details of your previous experiments (here or off-list), I would be very interested. >> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of >> an individual queue or application to make sure that IRQ suspension is >> enabled/disabled right away when the state of the system changes from busy >> to idle and back. > > Can we not handle everything in napi_busy_loop? If we can mark some napi > contexts as "explicitly polled by userspace with a larger defer timeout", > we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > which is more like "this particular napi_poll call is user busy polling". Then either the application needs to be polling all the time (wasting cpu cycles) or latencies will be determined by the timeout. Only when switching back and forth between polling and interrupts is it possible to get low latencies across a large spectrum of offered loads without burning cpu cycles at 100%. >>>> [snip] >>>> >>>>>> - suspendX: >>>>>> - set defer_hard_irqs to 100 >>>>>> - set gro_flush_timeout to X,000 >>>>>> - set irq_suspend_timeout to 20,000,000 >>>>>> - enable busy poll via the existing ioctl (busy_poll_usecs = 0, >>>>>> busy_poll_budget = 64, prefer_busy_poll = true) >>>>> >>>>> What's the intention of `busy_poll_usecs = 0` here? Presumably we fallback >>>>> to busy_poll sysctl value? >>>> >>>> Before this patch set, ep_poll only calls napi_busy_poll, if busy_poll >>>> (sysctl) or busy_poll_usecs is nonzero. However, this might lead to >>>> busy-polling even when the application does not actually need or want it. >>>> Only one iteration through the busy loop is needed to make the new scheme >>>> work. Additional napi busy polling over and above is optional. >>> >>> Ack, thanks, was trying to understand why not stay with >>> busy_poll_usecs=64 for consistency. But I guess you were just >>> trying to show that patch 4/5 works. >> >> Right, and we would potentially be wasting CPU cycles by adding more >> busy-looping. > > Or potentially improving the latency more if you happen to get more packets > during busy_poll_usecs duration? I'd imagine some applications might > prefer to 100% busy poll without ever going to sleep (that would probably > require getting rid of napi_id tracking in epoll, but that's a different story). Yes, one could do full application-to-napi busy polling. The performance would be slightly better than irq suspension, but it would be quite wasteful during low load. One premise for our work is that saving cycles is a meaningful objective. Thanks, Martin
On 08/12, Martin Karsten wrote: > On 2024-08-12 21:54, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > > > On 2024-08-12 19:03, Stanislav Fomichev wrote: > > > > On 08/12, Martin Karsten wrote: > > > > > On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > > > > On 08/12, Joe Damato wrote: > > > > > > > Greetings: > > > > > > > > > > > > > > Martin Karsten (CC'd) and I have been collaborating on some ideas about > > > > > > > ways of reducing tail latency when using epoll-based busy poll and we'd > > > > > > > love to get feedback from the list on the code in this series. This is > > > > > > > the idea I mentioned at netdev conf, for those who were there. Barring > > > > > > > any major issues, we hope to submit this officially shortly after RFC. > > > > > > > > > > > > > > The basic idea for suspending IRQs in this manner was described in an > > > > > > > earlier paper presented at Sigmetrics 2024 [1]. > > > > > > > > > > > > Let me explicitly call out the paper. Very nice analysis! > > > > > > > > > > Thank you! > > > > > > > > > > [snip] > > > > > > > > > > > > Here's how it is intended to work: > > > > > > > - An administrator sets the existing sysfs parameters for > > > > > > > defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > > > > > > > > > > > > > > - An administrator sets the new sysfs parameter irq_suspend_timeout > > > > > > > to a larger value than gro-timeout to enable IRQ suspension. > > > > > > > > > > > > Can you expand more on what's the problem with the existing gro_flush_timeout? > > > > > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > > > > > > perfer_busy_poll case(why?)? Because looking at the first two patches, > > > > > > you essentially replace all usages of gro_flush_timeout with a new variable > > > > > > and I don't see how it helps. > > > > > > > > > > gro-flush-timeout (in combination with defer-hard-irqs) is the default irq > > > > > deferral mechanism and as such, always active when configured. Its static > > > > > periodic softirq processing leads to a situation where: > > > > > > > > > > - A long gro-flush-timeout causes high latencies when load is sufficiently > > > > > below capacity, or > > > > > > > > > > - a short gro-flush-timeout causes overhead when softirq execution > > > > > asynchronously competes with application processing at high load. > > > > > > > > > > The shortcomings of this are documented (to some extent) by our experiments. > > > > > See defer20 working well at low load, but having problems at high load, > > > > > while defer200 having higher latency at low load. > > > > > > > > > > irq-suspend-timeout is only active when an application uses > > > > > prefer-busy-polling and in that case, produces a nice alternating pattern of > > > > > application processing and networking processing (similar to what we > > > > > describe in the paper). This then works well with both low and high load. > > > > > > > > So you only want it for the prefer-busy-pollingc case, makes sense. I was > > > > a bit confused by the difference between defer200 and suspend200, > > > > but now I see that defer200 does not enable busypoll. > > > > > > > > I'm assuming that if you enable busypool in defer200 case, the numbers > > > > should be similar to suspend200 (ignoring potentially affecting > > > > non-busypolling queues due to higher gro_flush_timeout). > > > > > > defer200 + napi busy poll is essentially what we labelled "busy" and it does > > > not perform as well, since it still suffers interference between application > > > and softirq processing. > > > > With all your patches applied? Why? Userspace not keeping up? > > Note our "busy" case does not utilize our patches. Great, thanks for confirming, that makes sense! > As illustrated by our performance numbers, its performance is better than > the base case, but at the cost of higher cpu utilization and it's still not > as good as suspend20. > > Explanation (conjecture): > > It boils down to having to set a particular static value for > gro-flush-timeout that is then always active. > > If busy-poll + application processing takes longer than this timeout, the > next softirq runs while the application is still active, which causes > interference. > > Once a softirq runs, the irq-loop (Loop 2) takes control. When the > application thread comes back to epoll_wait, it already finds data, thus > ep_poll does not run napi_busy_poll at all, thus the irq-loop stays in > control. > > This continues until by chance the application finds no readily available > data when calling epoll_wait and ep_poll runs another napi_busy_poll. Then > the system switches back to busy-polling mode. > > So essentially the system non-deterministically alternates between > busy-polling and irq deferral. irq deferral determines the high-order tail > latencies, but there is still enough interference to make a difference. It's > not as bad as in the base case, but not as good as properly controlled irq > suspension. > > > > > > > Maybe expand more on what code paths are we trying to improve? Existing > > > > > > busy polling code is not super readable, so would be nice to simplify > > > > > > it a bit in the process (if possible) instead of adding one more tunable. > > > > > > > > > > There are essentially three possible loops for network processing: > > > > > > > > > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality > > > > > > > > > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > > > > > with the shortcomings described above > > > > > > > > > > 3) epoll -> busy-poll -> napi poll > > > > > > > > > > If a system is configured for 1), not much can be done, as it is difficult > > > > > to interject anything into this loop without adding state and side effects. > > > > > This is what we tried for the paper, but it ended up being a hack. > > > > > > > > > > If however the system is configured for irq deferral, Loops 2) and 3) > > > > > "wrestle" with each other for control. Injecting the larger > > > > > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > > > > > of Loop 3) and creates the nice pattern describe above. > > > > > > > > And you hit (2) when the epoll goes to sleep and/or when the userspace > > > > isn't fast enough to keep up with the timer, presumably? I wonder > > > > if need to use this opportunity and do proper API as Joe hints in the > > > > cover letter. Something over netlink to say "I'm gonna busy-poll on > > > > this queue / napi_id and with this timeout". And then we can essentially make > > > > gro_flush_timeout per queue (and avoid > > > > napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > > > > too hacky already :-( > > > > > > If someone would implement the necessary changes to make these parameters > > > per-napi, this would improve things further, but note that the current > > > proposal gives strong performance across a range of workloads, which is > > > otherwise difficult to impossible to achieve. > > > > Let's see what other people have to say. But we tried to do a similar > > setup at Google recently and getting all these parameters right > > was not trivial. Joe's recent patch series to push some of these into > > epoll context are a step in the right direction. It would be nice to > > have more explicit interface to express busy poling preference for > > the users vs chasing a bunch of global tunables and fighting against softirq > > wakups. > > One of the goals of this patch set is to reduce parameter tuning and make > the parameter setting independent of workload dynamics, so it should make > things easier. This is of course notwithstanding that per-napi settings > would be even better. > > If you are able to share more details of your previous experiments (here or > off-list), I would be very interested. We went through a similar exercise of trying to get the tail latencies down. Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except we went with a hard-coded napi_id argument instead of tracking) and trying to get a workable set of budget/timeout/gro_flush. We were fine with burning all cpu capacity we had and no sleep at all, so we ended up having a bunch of special cases in epoll loop to avoid the sleep. But we were trying to make a different model work (the one you mention in the paper as well) where the userspace busy-pollers are just running napi_poll on one cpu and the actual work is consumed by the userspace on a different cpu. (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll and another epoll fd with actual sockets for signaling). This mode has a different set of challenges with socket lock, socket rx queue and the backlog processing :-( > > > Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > > > an individual queue or application to make sure that IRQ suspension is > > > enabled/disabled right away when the state of the system changes from busy > > > to idle and back. > > > > Can we not handle everything in napi_busy_loop? If we can mark some napi > > contexts as "explicitly polled by userspace with a larger defer timeout", > > we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > > which is more like "this particular napi_poll call is user busy polling". > > Then either the application needs to be polling all the time (wasting cpu > cycles) or latencies will be determined by the timeout. > > Only when switching back and forth between polling and interrupts is it > possible to get low latencies across a large spectrum of offered loads > without burning cpu cycles at 100%. Ah, I see what you're saying, yes, you're right. In this case ignore my comment about ep_suspend_napi_irqs/napi_resume_irqs. Let's see how other people feel about per-dev irq_suspend_timeout. Properly disabling napi during busy polling is super useful, but it would still be nice to plumb irq_suspend_timeout via epoll context or have it set on a per-napi basis imho.
On 2024-08-13 00:07, Stanislav Fomichev wrote: > On 08/12, Martin Karsten wrote: >> On 2024-08-12 21:54, Stanislav Fomichev wrote: >>> On 08/12, Martin Karsten wrote: >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: >>>>> On 08/12, Martin Karsten wrote: >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>>>>>> On 08/12, Joe Damato wrote: >>>>>>>> Greetings: [snip] >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing >>>>>>> busy polling code is not super readable, so would be nice to simplify >>>>>>> it a bit in the process (if possible) instead of adding one more tunable. >>>>>> >>>>>> There are essentially three possible loops for network processing: >>>>>> >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality >>>>>> >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme >>>>>> with the shortcomings described above >>>>>> >>>>>> 3) epoll -> busy-poll -> napi poll >>>>>> >>>>>> If a system is configured for 1), not much can be done, as it is difficult >>>>>> to interject anything into this loop without adding state and side effects. >>>>>> This is what we tried for the paper, but it ended up being a hack. >>>>>> >>>>>> If however the system is configured for irq deferral, Loops 2) and 3) >>>>>> "wrestle" with each other for control. Injecting the larger >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour >>>>>> of Loop 3) and creates the nice pattern describe above. >>>>> >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace >>>>> isn't fast enough to keep up with the timer, presumably? I wonder >>>>> if need to use this opportunity and do proper API as Joe hints in the >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on >>>>> this queue / napi_id and with this timeout". And then we can essentially make >>>>> gro_flush_timeout per queue (and avoid >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels >>>>> too hacky already :-( >>>> >>>> If someone would implement the necessary changes to make these parameters >>>> per-napi, this would improve things further, but note that the current >>>> proposal gives strong performance across a range of workloads, which is >>>> otherwise difficult to impossible to achieve. >>> >>> Let's see what other people have to say. But we tried to do a similar >>> setup at Google recently and getting all these parameters right >>> was not trivial. Joe's recent patch series to push some of these into >>> epoll context are a step in the right direction. It would be nice to >>> have more explicit interface to express busy poling preference for >>> the users vs chasing a bunch of global tunables and fighting against softirq >>> wakups. >> >> One of the goals of this patch set is to reduce parameter tuning and make >> the parameter setting independent of workload dynamics, so it should make >> things easier. This is of course notwithstanding that per-napi settings >> would be even better. >> >> If you are able to share more details of your previous experiments (here or >> off-list), I would be very interested. > > We went through a similar exercise of trying to get the tail latencies down. > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except > we went with a hard-coded napi_id argument instead of tracking) and trying to > get a workable set of budget/timeout/gro_flush. We were fine with burning all > cpu capacity we had and no sleep at all, so we ended up having a bunch > of special cases in epoll loop to avoid the sleep. > > But we were trying to make a different model work (the one you mention in the > paper as well) where the userspace busy-pollers are just running napi_poll > on one cpu and the actual work is consumed by the userspace on a different cpu. > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll > and another epoll fd with actual sockets for signaling). > > This mode has a different set of challenges with socket lock, socket rx > queue and the backlog processing :-( I agree. That model has challenges and is extremely difficult to tune right. >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of >>>> an individual queue or application to make sure that IRQ suspension is >>>> enabled/disabled right away when the state of the system changes from busy >>>> to idle and back. >>> >>> Can we not handle everything in napi_busy_loop? If we can mark some napi >>> contexts as "explicitly polled by userspace with a larger defer timeout", >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL >>> which is more like "this particular napi_poll call is user busy polling". >> >> Then either the application needs to be polling all the time (wasting cpu >> cycles) or latencies will be determined by the timeout. >> >> Only when switching back and forth between polling and interrupts is it >> possible to get low latencies across a large spectrum of offered loads >> without burning cpu cycles at 100%. > > Ah, I see what you're saying, yes, you're right. In this case ignore my comment > about ep_suspend_napi_irqs/napi_resume_irqs. Thanks for probing and double-checking everything! Feedback is important for us to properly document our proposal. > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > disabling napi during busy polling is super useful, but it would still > be nice to plumb irq_suspend_timeout via epoll context or have it set on > a per-napi basis imho. Fingers crossed. I hope this patch will be accepted, because it has practical performance and efficiency benefits, and that this will further increase the motivation to re-design the entire irq defer(/suspend) infrastructure for per-napi settings. Thanks, Martin
On Mon, 12 Aug 2024 17:46:42 -0400 Martin Karsten wrote: > >> Here's how it is intended to work: > >> - An administrator sets the existing sysfs parameters for > >> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. > >> > >> - An administrator sets the new sysfs parameter irq_suspend_timeout > >> to a larger value than gro-timeout to enable IRQ suspension. > > > > Can you expand more on what's the problem with the existing gro_flush_timeout? > > Is it defer_hard_irqs_count? Or you want a separate timeout only for the > > perfer_busy_poll case(why?)? Because looking at the first two patches, > > you essentially replace all usages of gro_flush_timeout with a new variable > > and I don't see how it helps. > > gro-flush-timeout (in combination with defer-hard-irqs) is the default > irq deferral mechanism and as such, always active when configured. Its > static periodic softirq processing leads to a situation where: > > - A long gro-flush-timeout causes high latencies when load is > sufficiently below capacity, or > > - a short gro-flush-timeout causes overhead when softirq execution > asynchronously competes with application processing at high load. > > The shortcomings of this are documented (to some extent) by our > experiments. See defer20 working well at low load, but having problems > at high load, while defer200 having higher latency at low load. > > irq-suspend-timeout is only active when an application uses > prefer-busy-polling and in that case, produces a nice alternating > pattern of application processing and networking processing (similar to > what we describe in the paper). This then works well with both low and > high load. What about NIC interrupt coalescing. defer_hard_irqs_count was supposed to be used with NICs which either don't have IRQ coalescing or have a broken implementation. The timeout of 200usec should be perfectly within range of what NICs can support. If the NIC IRQ coalescing works, instead of adding a new timeout value we could add a new deferral control (replacing defer_hard_irqs_count) which would always kick in after seeing prefer_busy_poll() but also not kick in if the busy poll harvested 0 packets. > > Maybe expand more on what code paths are we trying to improve? Existing > > busy polling code is not super readable, so would be nice to simplify > > it a bit in the process (if possible) instead of adding one more tunable. > > There are essentially three possible loops for network processing: > > 1) hardirq -> softirq -> napi poll; this is the baseline functionality > > 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > with the shortcomings described above > > 3) epoll -> busy-poll -> napi poll > > If a system is configured for 1), not much can be done, as it is > difficult to interject anything into this loop without adding state and > side effects. This is what we tried for the paper, but it ended up being > a hack. > > If however the system is configured for irq deferral, Loops 2) and 3) > "wrestle" with each other for control. Injecting the larger > irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in > favour of Loop 3) and creates the nice pattern describe above.
On 2024-08-13 20:10, Jakub Kicinski wrote: > On Mon, 12 Aug 2024 17:46:42 -0400 Martin Karsten wrote: >>>> Here's how it is intended to work: >>>> - An administrator sets the existing sysfs parameters for >>>> defer_hard_irqs and gro_flush_timeout to enable IRQ deferral. >>>> >>>> - An administrator sets the new sysfs parameter irq_suspend_timeout >>>> to a larger value than gro-timeout to enable IRQ suspension. >>> >>> Can you expand more on what's the problem with the existing gro_flush_timeout? >>> Is it defer_hard_irqs_count? Or you want a separate timeout only for the >>> perfer_busy_poll case(why?)? Because looking at the first two patches, >>> you essentially replace all usages of gro_flush_timeout with a new variable >>> and I don't see how it helps. >> >> gro-flush-timeout (in combination with defer-hard-irqs) is the default >> irq deferral mechanism and as such, always active when configured. Its >> static periodic softirq processing leads to a situation where: >> >> - A long gro-flush-timeout causes high latencies when load is >> sufficiently below capacity, or >> >> - a short gro-flush-timeout causes overhead when softirq execution >> asynchronously competes with application processing at high load. >> >> The shortcomings of this are documented (to some extent) by our >> experiments. See defer20 working well at low load, but having problems >> at high load, while defer200 having higher latency at low load. >> >> irq-suspend-timeout is only active when an application uses >> prefer-busy-polling and in that case, produces a nice alternating >> pattern of application processing and networking processing (similar to >> what we describe in the paper). This then works well with both low and >> high load. > > What about NIC interrupt coalescing. defer_hard_irqs_count was supposed > to be used with NICs which either don't have IRQ coalescing or have a > broken implementation. The timeout of 200usec should be perfectly within > range of what NICs can support. > > If the NIC IRQ coalescing works, instead of adding a new timeout value > we could add a new deferral control (replacing defer_hard_irqs_count) > which would always kick in after seeing prefer_busy_poll() but also > not kick in if the busy poll harvested 0 packets. Maybe I am missing something, but I believe this would have the same problem that we describe for gro-timeout + defer-irq. When busy poll does not harvest packets and the application thread is idle and goes to sleep, it would then take up to 200 us to get the next interrupt. This considerably increases tail latencies under low load. In order get low latencies under low load, the NIC timeout would have to be something like 20 us, but under high load the application thread will be busy for longer than 20 us and the interrupt (and softirq) will come too early and cause interference. The fundamental problem is that one fixed timer cannot handle dynamic workloads, regardless of whether the timer is implemented in software or the NIC. However, the current software implementation of the timer makes it easy to add our mechanism that effectively switches between a short and a long timeout. I assume it would be more difficult/overhead to update the NIC timer all the time. In other words, the complexity is always the same: A very long timeout is needed to suspend irqs during periods of successful busy polling and application processing. A short timeout is needed to receive the next packet(s) with low latency during idle periods. It is tempting to think of the second timeout as 0 and in fact re-enable interrupts right away. We have tried it, but it leads to a lot of interrupts and corresponding inefficiencies, since a system below capacity frequently switches between busy and idle. Using a small timeout (20 us) for modest deferral and batching when idle is a lot more efficient. Thanks, Martin
Martin Karsten wrote: > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>> On 08/12, Martin Karsten wrote: > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>> On 08/12, Martin Karsten wrote: > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>> Greetings: > > [snip] > > >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing > >>>>>>> busy polling code is not super readable, so would be nice to simplify > >>>>>>> it a bit in the process (if possible) instead of adding one more tunable. > >>>>>> > >>>>>> There are essentially three possible loops for network processing: > >>>>>> > >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality > >>>>>> > >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > >>>>>> with the shortcomings described above > >>>>>> > >>>>>> 3) epoll -> busy-poll -> napi poll > >>>>>> > >>>>>> If a system is configured for 1), not much can be done, as it is difficult > >>>>>> to interject anything into this loop without adding state and side effects. > >>>>>> This is what we tried for the paper, but it ended up being a hack. > >>>>>> > >>>>>> If however the system is configured for irq deferral, Loops 2) and 3) > >>>>>> "wrestle" with each other for control. Injecting the larger > >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > >>>>>> of Loop 3) and creates the nice pattern describe above. > >>>>> > >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace > >>>>> isn't fast enough to keep up with the timer, presumably? I wonder > >>>>> if need to use this opportunity and do proper API as Joe hints in the > >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on > >>>>> this queue / napi_id and with this timeout". And then we can essentially make > >>>>> gro_flush_timeout per queue (and avoid > >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > >>>>> too hacky already :-( > >>>> > >>>> If someone would implement the necessary changes to make these parameters > >>>> per-napi, this would improve things further, but note that the current > >>>> proposal gives strong performance across a range of workloads, which is > >>>> otherwise difficult to impossible to achieve. > >>> > >>> Let's see what other people have to say. But we tried to do a similar > >>> setup at Google recently and getting all these parameters right > >>> was not trivial. Joe's recent patch series to push some of these into > >>> epoll context are a step in the right direction. It would be nice to > >>> have more explicit interface to express busy poling preference for > >>> the users vs chasing a bunch of global tunables and fighting against softirq > >>> wakups. > >> > >> One of the goals of this patch set is to reduce parameter tuning and make > >> the parameter setting independent of workload dynamics, so it should make > >> things easier. This is of course notwithstanding that per-napi settings > >> would be even better. I don't follow how adding another tunable reduces parameter tuning. > >> > >> If you are able to share more details of your previous experiments (here or > >> off-list), I would be very interested. > > > > We went through a similar exercise of trying to get the tail latencies down. > > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except > > we went with a hard-coded napi_id argument instead of tracking) and trying to > > get a workable set of budget/timeout/gro_flush. We were fine with burning all > > cpu capacity we had and no sleep at all, so we ended up having a bunch > > of special cases in epoll loop to avoid the sleep. > > > > But we were trying to make a different model work (the one you mention in the > > paper as well) where the userspace busy-pollers are just running napi_poll > > on one cpu and the actual work is consumed by the userspace on a different cpu. > > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll > > and another epoll fd with actual sockets for signaling). > > > > This mode has a different set of challenges with socket lock, socket rx > > queue and the backlog processing :-( > > I agree. That model has challenges and is extremely difficult to tune right. > > >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > >>>> an individual queue or application to make sure that IRQ suspension is > >>>> enabled/disabled right away when the state of the system changes from busy > >>>> to idle and back. > >>> > >>> Can we not handle everything in napi_busy_loop? If we can mark some napi > >>> contexts as "explicitly polled by userspace with a larger defer timeout", > >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > >>> which is more like "this particular napi_poll call is user busy polling". > >> > >> Then either the application needs to be polling all the time (wasting cpu > >> cycles) or latencies will be determined by the timeout. > >> > >> Only when switching back and forth between polling and interrupts is it > >> possible to get low latencies across a large spectrum of offered loads > >> without burning cpu cycles at 100%. > > > > Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > about ep_suspend_napi_irqs/napi_resume_irqs. > > Thanks for probing and double-checking everything! Feedback is important > for us to properly document our proposal. > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > disabling napi during busy polling is super useful, but it would still > > be nice to plumb irq_suspend_timeout via epoll context or have it set on > > a per-napi basis imho. > > Fingers crossed. I hope this patch will be accepted, because it has > practical performance and efficiency benefits, and that this will > further increase the motivation to re-design the entire irq > defer(/suspend) infrastructure for per-napi settings. Overall, the idea of keeping interrupts disabled during event processing is very interesting. Hopefully the interface can be made more intuitive. Or documented more easily. I had to read the kernel patches to fully (perhaps) grasp it. Another +1 on the referenced paper. Pointing out a specific difference in behavior that is unrelated to the protection domain, rather than a straightforward kernel vs user argument. The paper also had some explanation that may be clearer for a commit message than the current cover letter: "user-level network stacks put the application in charge of the entire network stack processing (cf. Section 2). Interrupts are disabled and the application coordinates execution by alternating between processing existing requests and polling the RX queues for new data" " [This series extends this behavior to kernel busy polling, while falling back onto interrupt processing to limit CPU overhead.] "Instead of re-enabling the respective interrupt(s) as soon as epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay masked until a subsequent epoll_wait() call comes up empty, i.e., no events of interest are found and the application thread is about to be blocked." "A fallback technical approach would use a kernel timeout set on the return path from epoll_wait(). If necessary, the timeout re-enables interrupts regardless of the application’s (mis)behaviour." [Where misbehavior is not calling epoll_wait again] "The resulting execution model mimics the execution model of typical user-level network stacks and does not add any requirements compared to user-level networking. In fact, it is slightly better, because it can resort to blocking and interrupt delivery, instead of having to continuously busyloop during idle times." This last part shows a preference on your part to a trade-off: you want low latency, but also low cpu utilization if possible. This also came up in this thread. Please state that design decision explicitly. There are plenty of workloads where burning a core is acceptable (especially as core count continues increasing), not "slightly worse". Kernel polling with full busy polling is also already possible, by choosing a very high napi_defer_hard_irqs and gro_flush_timeout. So high in fact, that these tunables need not be tuned carefully. So what this series add is not interrupt suppression during event processing per se, but doing so in a hybrid mode balancing latency and cpu load.
On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote: > Martin Karsten wrote: > > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > > On 08/12, Martin Karsten wrote: > > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > > >>> On 08/12, Martin Karsten wrote: > > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > > >>>>> On 08/12, Martin Karsten wrote: > > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > > >>>>>>> On 08/12, Joe Damato wrote: [...] > > >> > > >> One of the goals of this patch set is to reduce parameter tuning and make > > >> the parameter setting independent of workload dynamics, so it should make > > >> things easier. This is of course notwithstanding that per-napi settings > > >> would be even better. > > I don't follow how adding another tunable reduces parameter tuning. Thanks for taking a look and providing valuable feedback, Willem. An early draft of the cover letter included some paragraphs which were removed for the sake of brevity that we can add back in, which addresses your comment above: The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout) is useful, but picking the correct values for these settings is difficult and the ideal values change as the type of traffic and workload changes. For example: picking a large timeout value is good for batching packet processing, but induces latency. Picking a small timeout value will interrupt user app processing and reduce efficiency. The value chosen would be different depending on whether the system is under high load (large timeout) or when less busy (small timeout). As such, adding the new tunable makes it much easier to use the existing ones and also produces better performance as shown in the results we presented. Please let me know if you have any questions; I know that the change we are introducing is very subtle and I am happy to expand the cover letter if it'd be helpful for you. My concern was that the cover letter was too long already, but a big takeaway for me thus far has been that we should expand the cover letter. [...] > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > > disabling napi during busy polling is super useful, but it would still > > > be nice to plumb irq_suspend_timeout via epoll context or have it set on > > > a per-napi basis imho. > > > > Fingers crossed. I hope this patch will be accepted, because it has > > practical performance and efficiency benefits, and that this will > > further increase the motivation to re-design the entire irq > > defer(/suspend) infrastructure for per-napi settings. > > Overall, the idea of keeping interrupts disabled during event > processing is very interesting. Thanks; I'm happy to hear we are aligned on this. > Hopefully the interface can be made more intuitive. Or documented more > easily. I had to read the kernel patches to fully (perhaps) grasp it. > > Another +1 on the referenced paper. Pointing out a specific difference > in behavior that is unrelated to the protection domain, rather than a > straightforward kernel vs user argument. The paper also had some > explanation that may be clearer for a commit message than the current > cover letter: > > "user-level network stacks put the application in charge of the entire > network stack processing (cf. Section 2). Interrupts are disabled and > the application coordinates execution by alternating between > processing existing requests and polling the RX queues for new data" > " [This series extends this behavior to kernel busy polling, while > falling back onto interrupt processing to limit CPU overhead.] > > "Instead of re-enabling the respective interrupt(s) as soon as > epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay > masked until a subsequent epoll_wait() call comes up empty, i.e., no > events of interest are found and the application thread is about to be > blocked." > > "A fallback technical approach would use a kernel timeout set on the > return path from epoll_wait(). If necessary, the timeout re-enables > interrupts regardless of the application´s (mis)behaviour." > [Where misbehavior is not calling epoll_wait again] > > "The resulting execution model mimics the execution model of typical > user-level network stacks and does not add any requirements compared > to user-level networking. In fact, it is slightly better, because it > can resort to blocking and interrupt delivery, instead of having to > continuously busyloop during idle times." > > This last part shows a preference on your part to a trade-off: > you want low latency, but also low cpu utilization if possible. > This also came up in this thread. Please state that design decision > explicitly. Sure, we can include that in the list of cover letter updates we need to make. I could have called it out more clearly, but in the cover letter [1] I mentioned that latency improved for compared CPU usage (i.e. CPU efficiency improved): The overall takeaway from the results below is that the new mechanism (suspend20, see below) results in reduced 99th percentile latency and increased QPS in the MAX QPS case (compared to the other cases), and reduced latency in the lower QPS cases for comparable CPU usage to the base case (and less CPU than the busy case). > There are plenty of workloads where burning a core is acceptable > (especially as core count continues increasing), not "slightly worse". Respectfully, I don't think I'm on board with this argument. "Burning a core" has side effects even as core counts increase (power, cooling, etc) and it seems the counter argument is equally valid, as well: there are plenty of workloads where burning a core is undesirable. Using less CPU to get comparable performance is strictly better, even if a system can theoretically support the increased CPU/power/cooling load. Either way: this is not an either/or. Adding support for the code we've proposed will be very beneficial for an important set of workloads without taking anything anyway. - Joe [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
Joe Damato wrote: > On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote: > > Martin Karsten wrote: > > > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > > > On 08/12, Martin Karsten wrote: > > > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > > > >>> On 08/12, Martin Karsten wrote: > > > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > > > >>>>> On 08/12, Martin Karsten wrote: > > > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > >>>>>>> On 08/12, Joe Damato wrote: > > [...] > > > > >> > > > >> One of the goals of this patch set is to reduce parameter tuning and make > > > >> the parameter setting independent of workload dynamics, so it should make > > > >> things easier. This is of course notwithstanding that per-napi settings > > > >> would be even better. > > > > I don't follow how adding another tunable reduces parameter tuning. > > Thanks for taking a look and providing valuable feedback, Willem. > > An early draft of the cover letter included some paragraphs which were removed > for the sake of brevity that we can add back in, which addresses your comment > above: > > The existing mechanism in the kernel (defer_hard_irqs and gro_flush_timeout) > is useful, but picking the correct values for these settings is difficult and > the ideal values change as the type of traffic and workload changes. > > For example: picking a large timeout value is good for batching packet > processing, but induces latency. Picking a small timeout value will interrupt > user app processing and reduce efficiency. The value chosen would be different > depending on whether the system is under high load (large timeout) or when less > busy (small timeout). > > As such, adding the new tunable makes it much easier to use the existing ones > and also produces better performance as shown in the results we presented. > > Please let me know if you have any questions; I know that the change we are > introducing is very subtle and I am happy to expand the cover letter if it'd be > helpful for you. > > My concern was that the cover letter was too long already, but a big takeaway > for me thus far has been that we should expand the cover letter. > > [...] > > > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > > > disabling napi during busy polling is super useful, but it would still > > > > be nice to plumb irq_suspend_timeout via epoll context or have it set on > > > > a per-napi basis imho. > > > > > > Fingers crossed. I hope this patch will be accepted, because it has > > > practical performance and efficiency benefits, and that this will > > > further increase the motivation to re-design the entire irq > > > defer(/suspend) infrastructure for per-napi settings. > > > > Overall, the idea of keeping interrupts disabled during event > > processing is very interesting. > > Thanks; I'm happy to hear we are aligned on this. > > > Hopefully the interface can be made more intuitive. Or documented more > > easily. I had to read the kernel patches to fully (perhaps) grasp it. > > > > Another +1 on the referenced paper. Pointing out a specific difference > > in behavior that is unrelated to the protection domain, rather than a > > straightforward kernel vs user argument. The paper also had some > > explanation that may be clearer for a commit message than the current > > cover letter: > > > > "user-level network stacks put the application in charge of the entire > > network stack processing (cf. Section 2). Interrupts are disabled and > > the application coordinates execution by alternating between > > processing existing requests and polling the RX queues for new data" > > " [This series extends this behavior to kernel busy polling, while > > falling back onto interrupt processing to limit CPU overhead.] > > > > "Instead of re-enabling the respective interrupt(s) as soon as > > epoll_wait() returns from its NAPI busy loop, the relevant IRQs stay > > masked until a subsequent epoll_wait() call comes up empty, i.e., no > > events of interest are found and the application thread is about to be > > blocked." > > > > "A fallback technical approach would use a kernel timeout set on the > > return path from epoll_wait(). If necessary, the timeout re-enables > > interrupts regardless of the application´s (mis)behaviour." > > [Where misbehavior is not calling epoll_wait again] > > > > "The resulting execution model mimics the execution model of typical > > user-level network stacks and does not add any requirements compared > > to user-level networking. In fact, it is slightly better, because it > > can resort to blocking and interrupt delivery, instead of having to > > continuously busyloop during idle times." > > > > This last part shows a preference on your part to a trade-off: > > you want low latency, but also low cpu utilization if possible. > > This also came up in this thread. Please state that design decision > > explicitly. > > Sure, we can include that in the list of cover letter updates we > need to make. I could have called it out more clearly, but in the cover > letter [1] I mentioned that latency improved for compared CPU usage (i.e. > CPU efficiency improved): > > The overall takeaway from the results below is that the new mechanism > (suspend20, see below) results in reduced 99th percentile latency and > increased QPS in the MAX QPS case (compared to the other cases), and > reduced latency in the lower QPS cases for comparable CPU usage to the > base case (and less CPU than the busy case). > > > There are plenty of workloads where burning a core is acceptable > > (especially as core count continues increasing), not "slightly worse". > > Respectfully, I don't think I'm on board with this argument. "Burning a core" > has side effects even as core counts increase (power, cooling, etc) and it > seems the counter argument is equally valid, as well: there are plenty of > workloads where burning a core is undesirable. Even more, likely. As this might be usable for standard efficiency focused production services. > Using less CPU to get comparable performance is strictly better, even if a > system can theoretically support the increased CPU/power/cooling load. If it is always a strict win yes. But falling back onto interrupts with standard moderation will not match busy polling in all cases. Different solutions for different workloads. No need to stack rank them. My request is just to be explicit which design point this chooses, and that the other design point (continuous busy polling) is already addressed in Linux kernel busypolling. > Either way: this is not an either/or. Adding support for the code we've > proposed will be very beneficial for an important set of workloads without > taking anything anyway. > > - Joe > > [1]: https://lore.kernel.org/netdev/20240812125717.413108-1-jdamato@fastly.com/
On Wed, Aug 14, 2024 at 11:08:45AM -0400, Willem de Bruijn wrote: > Joe Damato wrote: [...] > > On Tue, Aug 13, 2024 at 11:16:07PM -0400, Willem de Bruijn wrote: > > Using less CPU to get comparable performance is strictly better, even if a > > system can theoretically support the increased CPU/power/cooling load. > > If it is always a strict win yes. But falling back onto interrupts > with standard moderation will not match busy polling in all cases. > > Different solutions for different workloads. No need to stack rank > them. My request is just to be explicit which design point this > chooses, and that the other design point (continuous busy polling) is > already addressed in Linux kernel busypolling. Sure, sounds good; we can fix that in the cover letter. Thanks for taking a look.
On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > On 2024-08-13 00:07, Stanislav Fomichev wrote: > > On 08/12, Martin Karsten wrote: > >> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>> On 08/12, Martin Karsten wrote: > >>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>> On 08/12, Martin Karsten wrote: > >>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>> Greetings: > > [snip] > > >>>>>>> Maybe expand more on what code paths are we trying to improve? Existing > >>>>>>> busy polling code is not super readable, so would be nice to simplify > >>>>>>> it a bit in the process (if possible) instead of adding one more tunable. > >>>>>> > >>>>>> There are essentially three possible loops for network processing: > >>>>>> > >>>>>> 1) hardirq -> softirq -> napi poll; this is the baseline functionality > >>>>>> > >>>>>> 2) timer -> softirq -> napi poll; this is deferred irq processing scheme > >>>>>> with the shortcomings described above > >>>>>> > >>>>>> 3) epoll -> busy-poll -> napi poll > >>>>>> > >>>>>> If a system is configured for 1), not much can be done, as it is difficult > >>>>>> to interject anything into this loop without adding state and side effects. > >>>>>> This is what we tried for the paper, but it ended up being a hack. > >>>>>> > >>>>>> If however the system is configured for irq deferral, Loops 2) and 3) > >>>>>> "wrestle" with each other for control. Injecting the larger > >>>>>> irq-suspend-timeout for 'timer' in Loop 2) essentially tilts this in favour > >>>>>> of Loop 3) and creates the nice pattern describe above. > >>>>> > >>>>> And you hit (2) when the epoll goes to sleep and/or when the userspace > >>>>> isn't fast enough to keep up with the timer, presumably? I wonder > >>>>> if need to use this opportunity and do proper API as Joe hints in the > >>>>> cover letter. Something over netlink to say "I'm gonna busy-poll on > >>>>> this queue / napi_id and with this timeout". And then we can essentially make > >>>>> gro_flush_timeout per queue (and avoid > >>>>> napi_resume_irqs/napi_suspend_irqs). Existing gro_flush_timeout feels > >>>>> too hacky already :-( > >>>> > >>>> If someone would implement the necessary changes to make these parameters > >>>> per-napi, this would improve things further, but note that the current > >>>> proposal gives strong performance across a range of workloads, which is > >>>> otherwise difficult to impossible to achieve. > >>> > >>> Let's see what other people have to say. But we tried to do a similar > >>> setup at Google recently and getting all these parameters right > >>> was not trivial. Joe's recent patch series to push some of these into > >>> epoll context are a step in the right direction. It would be nice to > >>> have more explicit interface to express busy poling preference for > >>> the users vs chasing a bunch of global tunables and fighting against softirq > >>> wakups. > >> > >> One of the goals of this patch set is to reduce parameter tuning and make > >> the parameter setting independent of workload dynamics, so it should make > >> things easier. This is of course notwithstanding that per-napi settings > >> would be even better. > >> > >> If you are able to share more details of your previous experiments (here or > >> off-list), I would be very interested. > > > > We went through a similar exercise of trying to get the tail latencies down. > > Starting with SO_BUSY_POLL, then switching to the per-epoll variant (except > > we went with a hard-coded napi_id argument instead of tracking) and trying to > > get a workable set of budget/timeout/gro_flush. We were fine with burning all > > cpu capacity we had and no sleep at all, so we ended up having a bunch > > of special cases in epoll loop to avoid the sleep. > > > > But we were trying to make a different model work (the one you mention in the > > paper as well) where the userspace busy-pollers are just running napi_poll > > on one cpu and the actual work is consumed by the userspace on a different cpu. > > (we had two epoll fds - one with napi_id=xxx and no sockets to drive napi_poll > > and another epoll fd with actual sockets for signaling). > > > > This mode has a different set of challenges with socket lock, socket rx > > queue and the backlog processing :-( > > I agree. That model has challenges and is extremely difficult to tune right. We noticed a similar issue when we were using the same thread to do application work and also napi busy polling on it. Large gro_flush_timeout would improve throughput under load but reduce latency in low load and vice versa. We were actually thinking of having gro_flush_timeout value change based on whether napi is in busy_poll state and resetting it to default value for softirq driven mode when busy_polling is stopped. Since we didn't have cpu constraints, we had dedicated pollers polling napi and separate threads polling for epoll events and doing application work. > > >>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > >>>> an individual queue or application to make sure that IRQ suspension is > >>>> enabled/disabled right away when the state of the system changes from busy > >>>> to idle and back. > >>> > >>> Can we not handle everything in napi_busy_loop? If we can mark some napi > >>> contexts as "explicitly polled by userspace with a larger defer timeout", > >>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > >>> which is more like "this particular napi_poll call is user busy polling". > >> > >> Then either the application needs to be polling all the time (wasting cpu > >> cycles) or latencies will be determined by the timeout. But if I understand correctly, this means that if the application thread that is supposed to do napi busy polling gets busy doing work on the new data/events in userspace, napi polling will not be done until the suspend_timeout triggers? Do you dispatch work to a separate worker threads, in userspace, from the thread that is doing epoll_wait? > >> > >> Only when switching back and forth between polling and interrupts is it > >> possible to get low latencies across a large spectrum of offered loads > >> without burning cpu cycles at 100%. > > > > Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > about ep_suspend_napi_irqs/napi_resume_irqs. > > Thanks for probing and double-checking everything! Feedback is important > for us to properly document our proposal. > > > Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > disabling napi during busy polling is super useful, but it would still > > be nice to plumb irq_suspend_timeout via epoll context or have it set on > > a per-napi basis imho. I agree, this would allow each napi queue to tune itself based on heuristics. But I think doing it through epoll independent interface makes more sense as Stan suggested earlier. > > Fingers crossed. I hope this patch will be accepted, because it has > practical performance and efficiency benefits, and that this will > further increase the motivation to re-design the entire irq > defer(/suspend) infrastructure for per-napi settings. > > Thanks, > Martin > >
On 2024-08-14 15:53, Samiullah Khawaja wrote: > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >> >> On 2024-08-13 00:07, Stanislav Fomichev wrote: >>> On 08/12, Martin Karsten wrote: >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: >>>>> On 08/12, Martin Karsten wrote: >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: >>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>>>>>>>> On 08/12, Joe Damato wrote: >>>>>>>>>> Greetings: [snip] >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of >>>>>> an individual queue or application to make sure that IRQ suspension is >>>>>> enabled/disabled right away when the state of the system changes from busy >>>>>> to idle and back. >>>>> >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi >>>>> contexts as "explicitly polled by userspace with a larger defer timeout", >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL >>>>> which is more like "this particular napi_poll call is user busy polling". >>>> >>>> Then either the application needs to be polling all the time (wasting cpu >>>> cycles) or latencies will be determined by the timeout. > But if I understand correctly, this means that if the application > thread that is supposed > to do napi busy polling gets busy doing work on the new data/events in > userspace, napi polling > will not be done until the suspend_timeout triggers? Do you dispatch > work to a separate worker > threads, in userspace, from the thread that is doing epoll_wait? Yes, napi polling is suspended while the application is busy between epoll_wait calls. That's where the benefits are coming from. The consequences depend on the nature of the application and overall preferences for the system. If there's a "dominant" application for a number of queues and cores, the resulting latency for other background applications using the same queues might not be a problem at all. One other simple mitigation is limiting the number of events that each epoll_wait call accepts. Note that this batch size also determines the worst-case latency for the application in question, so there is a natural incentive to keep it limited. A more complex application design, like you suggest, might also be an option. >>>> Only when switching back and forth between polling and interrupts is it >>>> possible to get low latencies across a large spectrum of offered loads >>>> without burning cpu cycles at 100%. >>> >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment >>> about ep_suspend_napi_irqs/napi_resume_irqs. >> >> Thanks for probing and double-checking everything! Feedback is important >> for us to properly document our proposal. >> >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly >>> disabling napi during busy polling is super useful, but it would still >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on >>> a per-napi basis imho. > I agree, this would allow each napi queue to tune itself based on > heuristics. But I think > doing it through epoll independent interface makes more sense as Stan > suggested earlier. The question is whether to add a useful mechanism (one sysfs parameter and a few lines of code) that is optional, but with demonstrable and significant performance/efficiency improvements for an important class of applications - or wait for an uncertain future? Note that adding our mechanism in no way precludes switching the control parameters from per-device to per-napi as Joe alluded to earlier. In fact, it increases the incentive for doing so. After working on this for quite a while, I am skeptical that anything fundamentally different could be done without re-architecting the entire napi control flow. Thanks, Martin
Martin Karsten wrote: > On 2024-08-14 15:53, Samiullah Khawaja wrote: > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > >> > >> On 2024-08-13 00:07, Stanislav Fomichev wrote: > >>> On 08/12, Martin Karsten wrote: > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>>>> On 08/12, Martin Karsten wrote: > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>>>> On 08/12, Martin Karsten wrote: > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>>>> Greetings: > > [snip] > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > >>>>>> an individual queue or application to make sure that IRQ suspension is > >>>>>> enabled/disabled right away when the state of the system changes from busy > >>>>>> to idle and back. > >>>>> > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout", > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > >>>>> which is more like "this particular napi_poll call is user busy polling". > >>>> > >>>> Then either the application needs to be polling all the time (wasting cpu > >>>> cycles) or latencies will be determined by the timeout. > > But if I understand correctly, this means that if the application > > thread that is supposed > > to do napi busy polling gets busy doing work on the new data/events in > > userspace, napi polling > > will not be done until the suspend_timeout triggers? Do you dispatch > > work to a separate worker > > threads, in userspace, from the thread that is doing epoll_wait? > > Yes, napi polling is suspended while the application is busy between > epoll_wait calls. That's where the benefits are coming from. > > The consequences depend on the nature of the application and overall > preferences for the system. If there's a "dominant" application for a > number of queues and cores, the resulting latency for other background > applications using the same queues might not be a problem at all. > > One other simple mitigation is limiting the number of events that each > epoll_wait call accepts. Note that this batch size also determines the > worst-case latency for the application in question, so there is a > natural incentive to keep it limited. > > A more complex application design, like you suggest, might also be an > option. > > >>>> Only when switching back and forth between polling and interrupts is it > >>>> possible to get low latencies across a large spectrum of offered loads > >>>> without burning cpu cycles at 100%. > >>> > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment > >>> about ep_suspend_napi_irqs/napi_resume_irqs. > >> > >> Thanks for probing and double-checking everything! Feedback is important > >> for us to properly document our proposal. > >> > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly > >>> disabling napi during busy polling is super useful, but it would still > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on > >>> a per-napi basis imho. > > I agree, this would allow each napi queue to tune itself based on > > heuristics. But I think > > doing it through epoll independent interface makes more sense as Stan > > suggested earlier. > > The question is whether to add a useful mechanism (one sysfs parameter > and a few lines of code) that is optional, but with demonstrable and > significant performance/efficiency improvements for an important class > of applications - or wait for an uncertain future? The issue is that this one little change can never be removed, as it becomes ABI. Let's get the right API from the start. Not sure that a global variable, or sysfs as API, is the right one. > Note that adding our mechanism in no way precludes switching the control > parameters from per-device to per-napi as Joe alluded to earlier. In > fact, it increases the incentive for doing so. > > After working on this for quite a while, I am skeptical that anything > fundamentally different could be done without re-architecting the entire > napi control flow. > > Thanks, > Martin >
Willem de Bruijn wrote: > Martin Karsten wrote: > > On 2024-08-14 15:53, Samiullah Khawaja wrote: > > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > >> > > >> On 2024-08-13 00:07, Stanislav Fomichev wrote: > > >>> On 08/12, Martin Karsten wrote: > > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: > > >>>>> On 08/12, Martin Karsten wrote: > > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > > >>>>>>> On 08/12, Martin Karsten wrote: > > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > > >>>>>>>>> On 08/12, Joe Damato wrote: > > >>>>>>>>>> Greetings: > > > > [snip] > > > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > > >>>>>> an individual queue or application to make sure that IRQ suspension is > > >>>>>> enabled/disabled right away when the state of the system changes from busy > > >>>>>> to idle and back. > > >>>>> > > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi > > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout", > > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > > >>>>> which is more like "this particular napi_poll call is user busy polling". > > >>>> > > >>>> Then either the application needs to be polling all the time (wasting cpu > > >>>> cycles) or latencies will be determined by the timeout. > > > But if I understand correctly, this means that if the application > > > thread that is supposed > > > to do napi busy polling gets busy doing work on the new data/events in > > > userspace, napi polling > > > will not be done until the suspend_timeout triggers? Do you dispatch > > > work to a separate worker > > > threads, in userspace, from the thread that is doing epoll_wait? > > > > Yes, napi polling is suspended while the application is busy between > > epoll_wait calls. That's where the benefits are coming from. > > > > The consequences depend on the nature of the application and overall > > preferences for the system. If there's a "dominant" application for a > > number of queues and cores, the resulting latency for other background > > applications using the same queues might not be a problem at all. > > > > One other simple mitigation is limiting the number of events that each > > epoll_wait call accepts. Note that this batch size also determines the > > worst-case latency for the application in question, so there is a > > natural incentive to keep it limited. > > > > A more complex application design, like you suggest, might also be an > > option. > > > > >>>> Only when switching back and forth between polling and interrupts is it > > >>>> possible to get low latencies across a large spectrum of offered loads > > >>>> without burning cpu cycles at 100%. > > >>> > > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > >>> about ep_suspend_napi_irqs/napi_resume_irqs. > > >> > > >> Thanks for probing and double-checking everything! Feedback is important > > >> for us to properly document our proposal. > > >> > > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > >>> disabling napi during busy polling is super useful, but it would still > > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on > > >>> a per-napi basis imho. > > > I agree, this would allow each napi queue to tune itself based on > > > heuristics. But I think > > > doing it through epoll independent interface makes more sense as Stan > > > suggested earlier. > > > > The question is whether to add a useful mechanism (one sysfs parameter > > and a few lines of code) that is optional, but with demonstrable and > > significant performance/efficiency improvements for an important class > > of applications - or wait for an uncertain future? > > The issue is that this one little change can never be removed, as it > becomes ABI. > > Let's get the right API from the start. > > Not sure that a global variable, or sysfs as API, is the right one. Sorry per-device, not global. My main concern is that it adds yet another user tunable integer, for which the right value is not obvious. If the only goal is to safely reenable interrupts when the application stops calling epoll_wait, does this have to be user tunable? Can it be either a single good enough constant, or derived from another tunable, like busypoll_read.
On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: > Willem de Bruijn wrote: > > Martin Karsten wrote: > > > On 2024-08-14 15:53, Samiullah Khawaja wrote: > > > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > > >> > > > >> On 2024-08-13 00:07, Stanislav Fomichev wrote: > > > >>> On 08/12, Martin Karsten wrote: > > > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: > > > >>>>> On 08/12, Martin Karsten wrote: > > > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > > > >>>>>>> On 08/12, Martin Karsten wrote: > > > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > >>>>>>>>> On 08/12, Joe Damato wrote: > > > >>>>>>>>>> Greetings: > > > > > > [snip] > > > > > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > > > >>>>>> an individual queue or application to make sure that IRQ suspension is > > > >>>>>> enabled/disabled right away when the state of the system changes from busy > > > >>>>>> to idle and back. > > > >>>>> > > > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi > > > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout", > > > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > > > >>>>> which is more like "this particular napi_poll call is user busy polling". > > > >>>> > > > >>>> Then either the application needs to be polling all the time (wasting cpu > > > >>>> cycles) or latencies will be determined by the timeout. > > > > But if I understand correctly, this means that if the application > > > > thread that is supposed > > > > to do napi busy polling gets busy doing work on the new data/events in > > > > userspace, napi polling > > > > will not be done until the suspend_timeout triggers? Do you dispatch > > > > work to a separate worker > > > > threads, in userspace, from the thread that is doing epoll_wait? > > > > > > Yes, napi polling is suspended while the application is busy between > > > epoll_wait calls. That's where the benefits are coming from. > > > > > > The consequences depend on the nature of the application and overall > > > preferences for the system. If there's a "dominant" application for a > > > number of queues and cores, the resulting latency for other background > > > applications using the same queues might not be a problem at all. > > > > > > One other simple mitigation is limiting the number of events that each > > > epoll_wait call accepts. Note that this batch size also determines the > > > worst-case latency for the application in question, so there is a > > > natural incentive to keep it limited. > > > > > > A more complex application design, like you suggest, might also be an > > > option. > > > > > > >>>> Only when switching back and forth between polling and interrupts is it > > > >>>> possible to get low latencies across a large spectrum of offered loads > > > >>>> without burning cpu cycles at 100%. > > > >>> > > > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > > >>> about ep_suspend_napi_irqs/napi_resume_irqs. > > > >> > > > >> Thanks for probing and double-checking everything! Feedback is important > > > >> for us to properly document our proposal. > > > >> > > > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > > >>> disabling napi during busy polling is super useful, but it would still > > > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on > > > >>> a per-napi basis imho. > > > > I agree, this would allow each napi queue to tune itself based on > > > > heuristics. But I think > > > > doing it through epoll independent interface makes more sense as Stan > > > > suggested earlier. > > > > > > The question is whether to add a useful mechanism (one sysfs parameter > > > and a few lines of code) that is optional, but with demonstrable and > > > significant performance/efficiency improvements for an important class > > > of applications - or wait for an uncertain future? > > > > The issue is that this one little change can never be removed, as it > > becomes ABI. > > > > Let's get the right API from the start. > > > > Not sure that a global variable, or sysfs as API, is the right one. > > Sorry per-device, not global. > > My main concern is that it adds yet another user tunable integer, for > which the right value is not obvious. This is a feature for advanced users just like SO_INCOMING_NAPI_ID and countless other features. The value may not be obvious, but guidance (in the form of documentation) can be provided. > If the only goal is to safely reenable interrupts when the application > stops calling epoll_wait, does this have to be user tunable? > > Can it be either a single good enough constant, or derived from > another tunable, like busypoll_read. I believe you meant busy_read here, is that right? At any rate: - I don't think a single constant is appropriate, just as it wasn't appropriate for the existing mechanism (napi_defer_hard_irqs/gro_flush_timeout), and - Deriving the value from a pre-existing parameter to preserve the ABI, like busy_read, makes using this more confusing for users and complicates the API significantly. I agree we should get the API right from the start; that's why we've submit this as an RFC ;) We are happy to take suggestions from the community, but, IMHO, re-using an existing parameter for a different purpose only in certain circumstances (if I understand your suggestions) is a much worse choice than adding a new tunable that clearly states its intended singular purpose. - Joe
Joe Damato wrote: > On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: > > Willem de Bruijn wrote: > > > Martin Karsten wrote: > > > > On 2024-08-14 15:53, Samiullah Khawaja wrote: > > > > > On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > > > > >> > > > > >> On 2024-08-13 00:07, Stanislav Fomichev wrote: > > > > >>> On 08/12, Martin Karsten wrote: > > > > >>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: > > > > >>>>> On 08/12, Martin Karsten wrote: > > > > >>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > > > > >>>>>>> On 08/12, Martin Karsten wrote: > > > > >>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > > > > >>>>>>>>> On 08/12, Joe Damato wrote: > > > > >>>>>>>>>> Greetings: > > > > > > > > [snip] > > > > > > > > >>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > > > > >>>>>> an individual queue or application to make sure that IRQ suspension is > > > > >>>>>> enabled/disabled right away when the state of the system changes from busy > > > > >>>>>> to idle and back. > > > > >>>>> > > > > >>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi > > > > >>>>> contexts as "explicitly polled by userspace with a larger defer timeout", > > > > >>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > > > > >>>>> which is more like "this particular napi_poll call is user busy polling". > > > > >>>> > > > > >>>> Then either the application needs to be polling all the time (wasting cpu > > > > >>>> cycles) or latencies will be determined by the timeout. > > > > > But if I understand correctly, this means that if the application > > > > > thread that is supposed > > > > > to do napi busy polling gets busy doing work on the new data/events in > > > > > userspace, napi polling > > > > > will not be done until the suspend_timeout triggers? Do you dispatch > > > > > work to a separate worker > > > > > threads, in userspace, from the thread that is doing epoll_wait? > > > > > > > > Yes, napi polling is suspended while the application is busy between > > > > epoll_wait calls. That's where the benefits are coming from. > > > > > > > > The consequences depend on the nature of the application and overall > > > > preferences for the system. If there's a "dominant" application for a > > > > number of queues and cores, the resulting latency for other background > > > > applications using the same queues might not be a problem at all. > > > > > > > > One other simple mitigation is limiting the number of events that each > > > > epoll_wait call accepts. Note that this batch size also determines the > > > > worst-case latency for the application in question, so there is a > > > > natural incentive to keep it limited. > > > > > > > > A more complex application design, like you suggest, might also be an > > > > option. > > > > > > > > >>>> Only when switching back and forth between polling and interrupts is it > > > > >>>> possible to get low latencies across a large spectrum of offered loads > > > > >>>> without burning cpu cycles at 100%. > > > > >>> > > > > >>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment > > > > >>> about ep_suspend_napi_irqs/napi_resume_irqs. > > > > >> > > > > >> Thanks for probing and double-checking everything! Feedback is important > > > > >> for us to properly document our proposal. > > > > >> > > > > >>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly > > > > >>> disabling napi during busy polling is super useful, but it would still > > > > >>> be nice to plumb irq_suspend_timeout via epoll context or have it set on > > > > >>> a per-napi basis imho. > > > > > I agree, this would allow each napi queue to tune itself based on > > > > > heuristics. But I think > > > > > doing it through epoll independent interface makes more sense as Stan > > > > > suggested earlier. > > > > > > > > The question is whether to add a useful mechanism (one sysfs parameter > > > > and a few lines of code) that is optional, but with demonstrable and > > > > significant performance/efficiency improvements for an important class > > > > of applications - or wait for an uncertain future? > > > > > > The issue is that this one little change can never be removed, as it > > > becomes ABI. > > > > > > Let's get the right API from the start. > > > > > > Not sure that a global variable, or sysfs as API, is the right one. > > > > Sorry per-device, not global. > > > > My main concern is that it adds yet another user tunable integer, for > > which the right value is not obvious. > > This is a feature for advanced users just like SO_INCOMING_NAPI_ID > and countless other features. > > The value may not be obvious, but guidance (in the form of > documentation) can be provided. Okay. Could you share a stab at what that would look like? > > If the only goal is to safely reenable interrupts when the application > > stops calling epoll_wait, does this have to be user tunable? > > > > Can it be either a single good enough constant, or derived from > > another tunable, like busypoll_read. > > I believe you meant busy_read here, is that right? > > At any rate: > > - I don't think a single constant is appropriate, just as it > wasn't appropriate for the existing mechanism > (napi_defer_hard_irqs/gro_flush_timeout), and > > - Deriving the value from a pre-existing parameter to preserve the > ABI, like busy_read, makes using this more confusing for users > and complicates the API significantly. > > I agree we should get the API right from the start; that's why we've > submit this as an RFC ;) > > We are happy to take suggestions from the community, but, IMHO, > re-using an existing parameter for a different purpose only in > certain circumstances (if I understand your suggestions) is a much > worse choice than adding a new tunable that clearly states its > intended singular purpose. Ack. I was thinking whether an epoll flag through your new epoll ioctl interface to toggle the IRQ suspension (and timer start) would be preferable. Because more fine grained. Also, the value is likely dependent more on the expected duration of userspace processing? If so, it would be the same for all devices, so does a per-netdev value make sense?
On 2024-08-16 13:01, Willem de Bruijn wrote: > Joe Damato wrote: >> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: >>> Willem de Bruijn wrote: >>>> Martin Karsten wrote: >>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote: >>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >>>>>>> >>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote: >>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: >>>>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: >>>>>>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>>>>>>>>>>>>> On 08/12, Joe Damato wrote: >>>>>>>>>>>>>>> Greetings: >>>>> >>>>> [snip] >>>>> >>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of >>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is >>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy >>>>>>>>>>> to idle and back. >>>>>>>>>> >>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi >>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout", >>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL >>>>>>>>>> which is more like "this particular napi_poll call is user busy polling". >>>>>>>>> >>>>>>>>> Then either the application needs to be polling all the time (wasting cpu >>>>>>>>> cycles) or latencies will be determined by the timeout. >>>>>> But if I understand correctly, this means that if the application >>>>>> thread that is supposed >>>>>> to do napi busy polling gets busy doing work on the new data/events in >>>>>> userspace, napi polling >>>>>> will not be done until the suspend_timeout triggers? Do you dispatch >>>>>> work to a separate worker >>>>>> threads, in userspace, from the thread that is doing epoll_wait? >>>>> >>>>> Yes, napi polling is suspended while the application is busy between >>>>> epoll_wait calls. That's where the benefits are coming from. >>>>> >>>>> The consequences depend on the nature of the application and overall >>>>> preferences for the system. If there's a "dominant" application for a >>>>> number of queues and cores, the resulting latency for other background >>>>> applications using the same queues might not be a problem at all. >>>>> >>>>> One other simple mitigation is limiting the number of events that each >>>>> epoll_wait call accepts. Note that this batch size also determines the >>>>> worst-case latency for the application in question, so there is a >>>>> natural incentive to keep it limited. >>>>> >>>>> A more complex application design, like you suggest, might also be an >>>>> option. >>>>> >>>>>>>>> Only when switching back and forth between polling and interrupts is it >>>>>>>>> possible to get low latencies across a large spectrum of offered loads >>>>>>>>> without burning cpu cycles at 100%. >>>>>>>> >>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment >>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs. >>>>>>> >>>>>>> Thanks for probing and double-checking everything! Feedback is important >>>>>>> for us to properly document our proposal. >>>>>>> >>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly >>>>>>>> disabling napi during busy polling is super useful, but it would still >>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on >>>>>>>> a per-napi basis imho. >>>>>> I agree, this would allow each napi queue to tune itself based on >>>>>> heuristics. But I think >>>>>> doing it through epoll independent interface makes more sense as Stan >>>>>> suggested earlier. >>>>> >>>>> The question is whether to add a useful mechanism (one sysfs parameter >>>>> and a few lines of code) that is optional, but with demonstrable and >>>>> significant performance/efficiency improvements for an important class >>>>> of applications - or wait for an uncertain future? >>>> >>>> The issue is that this one little change can never be removed, as it >>>> becomes ABI. >>>> >>>> Let's get the right API from the start. >>>> >>>> Not sure that a global variable, or sysfs as API, is the right one. >>> >>> Sorry per-device, not global. >>> >>> My main concern is that it adds yet another user tunable integer, for >>> which the right value is not obvious. >> >> This is a feature for advanced users just like SO_INCOMING_NAPI_ID >> and countless other features. >> >> The value may not be obvious, but guidance (in the form of >> documentation) can be provided. > > Okay. Could you share a stab at what that would look like? The timeout needs to be large enough that an application can get a meaningful number of incoming requests processed without softirq interference. At the same time, the timeout value determines the worst-case delivery delay that a concurrent application using the same queue(s) might experience. Please also see my response to Samiullah quoted above. The specific circumstances and trade-offs might vary, that's why a simple constant likely won't do. >>> If the only goal is to safely reenable interrupts when the application >>> stops calling epoll_wait, does this have to be user tunable? >>> >>> Can it be either a single good enough constant, or derived from >>> another tunable, like busypoll_read. >> >> I believe you meant busy_read here, is that right? >> >> At any rate: >> >> - I don't think a single constant is appropriate, just as it >> wasn't appropriate for the existing mechanism >> (napi_defer_hard_irqs/gro_flush_timeout), and >> >> - Deriving the value from a pre-existing parameter to preserve the >> ABI, like busy_read, makes using this more confusing for users >> and complicates the API significantly. >> >> I agree we should get the API right from the start; that's why we've >> submit this as an RFC ;) >> >> We are happy to take suggestions from the community, but, IMHO, >> re-using an existing parameter for a different purpose only in >> certain circumstances (if I understand your suggestions) is a much >> worse choice than adding a new tunable that clearly states its >> intended singular purpose. > > Ack. I was thinking whether an epoll flag through your new epoll > ioctl interface to toggle the IRQ suspension (and timer start) > would be preferable. Because more fine grained. A value provided by an application through the epoll ioctl would not be subject to admin oversight, so a misbehaving application could set an arbitrary timeout value. A sysfs value needs to be set by an admin. The ideal timeout value depends both on the particular target application as well as concurrent applications using the same queue(s) - as sketched above. > Also, the value is likely dependent more on the expected duration > of userspace processing? If so, it would be the same for all > devices, so does a per-netdev value make sense? It is per-netdev in the current proposal to be at the same granularity as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension operates at the same level/granularity. This allows for more control than a global setting and it can be migrated to per-napi settings along with gro_flush_timeout and napi_defer_hard_irqs when the time comes. Thanks, Martin
Martin Karsten wrote: > On 2024-08-16 13:01, Willem de Bruijn wrote: > > Joe Damato wrote: > >> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: > >>> Willem de Bruijn wrote: > >>>> Martin Karsten wrote: > >>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote: > >>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: > >>>>>>> > >>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote: > >>>>>>>> On 08/12, Martin Karsten wrote: > >>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: > >>>>>>>>>> On 08/12, Martin Karsten wrote: > >>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: > >>>>>>>>>>>> On 08/12, Martin Karsten wrote: > >>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: > >>>>>>>>>>>>>> On 08/12, Joe Damato wrote: > >>>>>>>>>>>>>>> Greetings: > >>>>> > >>>>> [snip] > >>>>> > >>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of > >>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is > >>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy > >>>>>>>>>>> to idle and back. > >>>>>>>>>> > >>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi > >>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout", > >>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL > >>>>>>>>>> which is more like "this particular napi_poll call is user busy polling". > >>>>>>>>> > >>>>>>>>> Then either the application needs to be polling all the time (wasting cpu > >>>>>>>>> cycles) or latencies will be determined by the timeout. > >>>>>> But if I understand correctly, this means that if the application > >>>>>> thread that is supposed > >>>>>> to do napi busy polling gets busy doing work on the new data/events in > >>>>>> userspace, napi polling > >>>>>> will not be done until the suspend_timeout triggers? Do you dispatch > >>>>>> work to a separate worker > >>>>>> threads, in userspace, from the thread that is doing epoll_wait? > >>>>> > >>>>> Yes, napi polling is suspended while the application is busy between > >>>>> epoll_wait calls. That's where the benefits are coming from. > >>>>> > >>>>> The consequences depend on the nature of the application and overall > >>>>> preferences for the system. If there's a "dominant" application for a > >>>>> number of queues and cores, the resulting latency for other background > >>>>> applications using the same queues might not be a problem at all. > >>>>> > >>>>> One other simple mitigation is limiting the number of events that each > >>>>> epoll_wait call accepts. Note that this batch size also determines the > >>>>> worst-case latency for the application in question, so there is a > >>>>> natural incentive to keep it limited. > >>>>> > >>>>> A more complex application design, like you suggest, might also be an > >>>>> option. > >>>>> > >>>>>>>>> Only when switching back and forth between polling and interrupts is it > >>>>>>>>> possible to get low latencies across a large spectrum of offered loads > >>>>>>>>> without burning cpu cycles at 100%. > >>>>>>>> > >>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment > >>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs. > >>>>>>> > >>>>>>> Thanks for probing and double-checking everything! Feedback is important > >>>>>>> for us to properly document our proposal. > >>>>>>> > >>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly > >>>>>>>> disabling napi during busy polling is super useful, but it would still > >>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on > >>>>>>>> a per-napi basis imho. > >>>>>> I agree, this would allow each napi queue to tune itself based on > >>>>>> heuristics. But I think > >>>>>> doing it through epoll independent interface makes more sense as Stan > >>>>>> suggested earlier. > >>>>> > >>>>> The question is whether to add a useful mechanism (one sysfs parameter > >>>>> and a few lines of code) that is optional, but with demonstrable and > >>>>> significant performance/efficiency improvements for an important class > >>>>> of applications - or wait for an uncertain future? > >>>> > >>>> The issue is that this one little change can never be removed, as it > >>>> becomes ABI. > >>>> > >>>> Let's get the right API from the start. > >>>> > >>>> Not sure that a global variable, or sysfs as API, is the right one. > >>> > >>> Sorry per-device, not global. > >>> > >>> My main concern is that it adds yet another user tunable integer, for > >>> which the right value is not obvious. > >> > >> This is a feature for advanced users just like SO_INCOMING_NAPI_ID > >> and countless other features. > >> > >> The value may not be obvious, but guidance (in the form of > >> documentation) can be provided. > > > > Okay. Could you share a stab at what that would look like? > > The timeout needs to be large enough that an application can get a > meaningful number of incoming requests processed without softirq > interference. At the same time, the timeout value determines the > worst-case delivery delay that a concurrent application using the same > queue(s) might experience. Please also see my response to Samiullah > quoted above. The specific circumstances and trade-offs might vary, > that's why a simple constant likely won't do. Thanks. I really do mean this as an exercise of what documentation in Documentation/networking/napi.rst will look like. That helps makes the case that the interface is reasonably ease to use (even if only targeting advanced users). How does a user measure how much time a process will spend on processing a meaningful number of incoming requests, for instance. In practice, probably just a hunch? Playing devil's advocate some more: given that ethtool usecs have to be chosen with a similar trade-off between latency and efficiency, could a multiplicative factor of this (or gro_flush_timeout, same thing) be sufficient and easier to choose? The documentation does state that the value chosen must be >= gro_flush_timeout. > >>> If the only goal is to safely reenable interrupts when the application > >>> stops calling epoll_wait, does this have to be user tunable? > >>> > >>> Can it be either a single good enough constant, or derived from > >>> another tunable, like busypoll_read. > >> > >> I believe you meant busy_read here, is that right? > >> > >> At any rate: > >> > >> - I don't think a single constant is appropriate, just as it > >> wasn't appropriate for the existing mechanism > >> (napi_defer_hard_irqs/gro_flush_timeout), and > >> > >> - Deriving the value from a pre-existing parameter to preserve the > >> ABI, like busy_read, makes using this more confusing for users > >> and complicates the API significantly. > >> > >> I agree we should get the API right from the start; that's why we've > >> submit this as an RFC ;) > >> > >> We are happy to take suggestions from the community, but, IMHO, > >> re-using an existing parameter for a different purpose only in > >> certain circumstances (if I understand your suggestions) is a much > >> worse choice than adding a new tunable that clearly states its > >> intended singular purpose. > > > > Ack. I was thinking whether an epoll flag through your new epoll > > ioctl interface to toggle the IRQ suspension (and timer start) > > would be preferable. Because more fine grained. > > A value provided by an application through the epoll ioctl would not be > subject to admin oversight, so a misbehaving application could set an > arbitrary timeout value. A sysfs value needs to be set by an admin. The > ideal timeout value depends both on the particular target application as > well as concurrent applications using the same queue(s) - as sketched above. I meant setting the value systemwide (or per-device), but opting in to the feature a binary epoll options. Really an epoll_wait flag, if we had flags. Any admin privileged operations can also be protected at the epoll level by requiring CAP_NET_ADMIN too, of course. But fair point that this might operate in a multi-process environment, so values should not be hardcoded into the binaries. Just asking questions to explore the option space so as not to settle on an API too soon. Given that, as said, we cannot remove it later. > > Also, the value is likely dependent more on the expected duration > > of userspace processing? If so, it would be the same for all > > devices, so does a per-netdev value make sense? > > It is per-netdev in the current proposal to be at the same granularity > as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension > operates at the same level/granularity. This allows for more control > than a global setting and it can be migrated to per-napi settings along > with gro_flush_timeout and napi_defer_hard_irqs when the time comes. Ack, makes sense. Many of these design choices and their rationale are good to explicitly capture in the commit message.
On Fri, Aug 16, 2024 at 01:01:42PM -0400, Willem de Bruijn wrote: > Joe Damato wrote: > > On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: [...] > > > Willem de Bruijn wrote: > > > If the only goal is to safely reenable interrupts when the application > > > stops calling epoll_wait, does this have to be user tunable? > > > > > > Can it be either a single good enough constant, or derived from > > > another tunable, like busypoll_read. > > > > I believe you meant busy_read here, is that right? > > > > At any rate: > > > > - I don't think a single constant is appropriate, just as it > > wasn't appropriate for the existing mechanism > > (napi_defer_hard_irqs/gro_flush_timeout), and > > > > - Deriving the value from a pre-existing parameter to preserve the > > ABI, like busy_read, makes using this more confusing for users > > and complicates the API significantly. > > > > I agree we should get the API right from the start; that's why we've > > submit this as an RFC ;) > > > > We are happy to take suggestions from the community, but, IMHO, > > re-using an existing parameter for a different purpose only in > > certain circumstances (if I understand your suggestions) is a much > > worse choice than adding a new tunable that clearly states its > > intended singular purpose. > > Ack. I was thinking whether an epoll flag through your new epoll > ioctl interface to toggle the IRQ suspension (and timer start) > would be preferable. Because more fine grained. I understand why you are asking about this and I think it would be great if this were possible, but unfortunately it isn't. epoll contexts can be associated with any NAPI ID, but the IRQ suspension is NAPI specific. As an example: imagine a user program creates an epoll context and adds fds with NAPI ID 1111 to the context. It then issues the ioctl to set the suspend timeout for that context. Then, for whatever reason, the user app decides to remove all the fds and add new ones, this time from NAPI ID 2222, which happens to be a different net_device. What does that mean for the suspend timeout? It's not clear to me what the right behavior would be in this situation (does it persist? does it get cleared when a new NAPI ID is added? etc) and it makes the user API much more complicated, with many more edge cases and possible bugs. > Also, the value is likely dependent more on the expected duration > of userspace processing? If so, it would be the same for all > devices, so does a per-netdev value make sense? There is presently no way to set values like gro_timeout, defer_hard_irqs, or this new proposed value on a per-NAPI basis. IMHO, that is really where all of these values should live. I mentioned on the list previously (and also in the cover letter), that time permitting, I think the correct evolution of this would be to support per-NAPI settings (via netdev-genl, I assume) so that user programs can set all 3 values on only the NAPIs they care about. Until that functionality is available, it would seem per-netdev is the only way for this feature to be added at the present time. I simply haven't had the time to add the above interface. This feature we're proposing has demonstrable performance value, but it doesn't seem sensible to block it until time permits me to add a per-NAPI interface for all of these values given that we already globally expose 2 of them. That said, I appreciate the thoughtfulness of your replies and I am open to other suggestions. - Joe
On 2024-08-16 16:58, Willem de Bruijn wrote: > Martin Karsten wrote: >> On 2024-08-16 13:01, Willem de Bruijn wrote: >>> Joe Damato wrote: >>>> On Fri, Aug 16, 2024 at 10:59:51AM -0400, Willem de Bruijn wrote: >>>>> Willem de Bruijn wrote: >>>>>> Martin Karsten wrote: >>>>>>> On 2024-08-14 15:53, Samiullah Khawaja wrote: >>>>>>>> On Tue, Aug 13, 2024 at 6:19 AM Martin Karsten <mkarsten@uwaterloo.ca> wrote: >>>>>>>>> >>>>>>>>> On 2024-08-13 00:07, Stanislav Fomichev wrote: >>>>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>>>> On 2024-08-12 21:54, Stanislav Fomichev wrote: >>>>>>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>>>>>> On 2024-08-12 19:03, Stanislav Fomichev wrote: >>>>>>>>>>>>>> On 08/12, Martin Karsten wrote: >>>>>>>>>>>>>>> On 2024-08-12 16:19, Stanislav Fomichev wrote: >>>>>>>>>>>>>>>> On 08/12, Joe Damato wrote: >>>>>>>>>>>>>>>>> Greetings: >>>>>>> >>>>>>> [snip] >>>>>>> >>>>>>>>>>>>> Note that napi_suspend_irqs/napi_resume_irqs is needed even for the sake of >>>>>>>>>>>>> an individual queue or application to make sure that IRQ suspension is >>>>>>>>>>>>> enabled/disabled right away when the state of the system changes from busy >>>>>>>>>>>>> to idle and back. >>>>>>>>>>>> >>>>>>>>>>>> Can we not handle everything in napi_busy_loop? If we can mark some napi >>>>>>>>>>>> contexts as "explicitly polled by userspace with a larger defer timeout", >>>>>>>>>>>> we should be able to do better compared to current NAPI_F_PREFER_BUSY_POLL >>>>>>>>>>>> which is more like "this particular napi_poll call is user busy polling". >>>>>>>>>>> >>>>>>>>>>> Then either the application needs to be polling all the time (wasting cpu >>>>>>>>>>> cycles) or latencies will be determined by the timeout. >>>>>>>> But if I understand correctly, this means that if the application >>>>>>>> thread that is supposed >>>>>>>> to do napi busy polling gets busy doing work on the new data/events in >>>>>>>> userspace, napi polling >>>>>>>> will not be done until the suspend_timeout triggers? Do you dispatch >>>>>>>> work to a separate worker >>>>>>>> threads, in userspace, from the thread that is doing epoll_wait? >>>>>>> >>>>>>> Yes, napi polling is suspended while the application is busy between >>>>>>> epoll_wait calls. That's where the benefits are coming from. >>>>>>> >>>>>>> The consequences depend on the nature of the application and overall >>>>>>> preferences for the system. If there's a "dominant" application for a >>>>>>> number of queues and cores, the resulting latency for other background >>>>>>> applications using the same queues might not be a problem at all. >>>>>>> >>>>>>> One other simple mitigation is limiting the number of events that each >>>>>>> epoll_wait call accepts. Note that this batch size also determines the >>>>>>> worst-case latency for the application in question, so there is a >>>>>>> natural incentive to keep it limited. >>>>>>> >>>>>>> A more complex application design, like you suggest, might also be an >>>>>>> option. >>>>>>> >>>>>>>>>>> Only when switching back and forth between polling and interrupts is it >>>>>>>>>>> possible to get low latencies across a large spectrum of offered loads >>>>>>>>>>> without burning cpu cycles at 100%. >>>>>>>>>> >>>>>>>>>> Ah, I see what you're saying, yes, you're right. In this case ignore my comment >>>>>>>>>> about ep_suspend_napi_irqs/napi_resume_irqs. >>>>>>>>> >>>>>>>>> Thanks for probing and double-checking everything! Feedback is important >>>>>>>>> for us to properly document our proposal. >>>>>>>>> >>>>>>>>>> Let's see how other people feel about per-dev irq_suspend_timeout. Properly >>>>>>>>>> disabling napi during busy polling is super useful, but it would still >>>>>>>>>> be nice to plumb irq_suspend_timeout via epoll context or have it set on >>>>>>>>>> a per-napi basis imho. >>>>>>>> I agree, this would allow each napi queue to tune itself based on >>>>>>>> heuristics. But I think >>>>>>>> doing it through epoll independent interface makes more sense as Stan >>>>>>>> suggested earlier. >>>>>>> >>>>>>> The question is whether to add a useful mechanism (one sysfs parameter >>>>>>> and a few lines of code) that is optional, but with demonstrable and >>>>>>> significant performance/efficiency improvements for an important class >>>>>>> of applications - or wait for an uncertain future? >>>>>> >>>>>> The issue is that this one little change can never be removed, as it >>>>>> becomes ABI. >>>>>> >>>>>> Let's get the right API from the start. >>>>>> >>>>>> Not sure that a global variable, or sysfs as API, is the right one. >>>>> >>>>> Sorry per-device, not global. >>>>> >>>>> My main concern is that it adds yet another user tunable integer, for >>>>> which the right value is not obvious. >>>> >>>> This is a feature for advanced users just like SO_INCOMING_NAPI_ID >>>> and countless other features. >>>> >>>> The value may not be obvious, but guidance (in the form of >>>> documentation) can be provided. >>> >>> Okay. Could you share a stab at what that would look like? >> >> The timeout needs to be large enough that an application can get a >> meaningful number of incoming requests processed without softirq >> interference. At the same time, the timeout value determines the >> worst-case delivery delay that a concurrent application using the same >> queue(s) might experience. Please also see my response to Samiullah >> quoted above. The specific circumstances and trade-offs might vary, >> that's why a simple constant likely won't do. > > Thanks. I really do mean this as an exercise of what documentation in > Documentation/networking/napi.rst will look like. That helps makes the > case that the interface is reasonably ease to use (even if only > targeting advanced users). > > How does a user measure how much time a process will spend on > processing a meaningful number of incoming requests, for instance. > In practice, probably just a hunch? As an example, we measure around 1M QPS in our experiments, fully utilizing 8 cores and knowing that memcached is quite scalable. Thus we can conclude a single request takes about 8 us processing time on average. That has led us to a 20 us small timeout (gro_flush_timeout), enough to make sure that a single request is likely not interfered with, but otherwise as small as possible. If multiple requests arrive, the system will quickly switch back to polling mode. At the other end, we have picked a very large irq_suspend_timeout of 20,000 us to demonstrate that it does not negatively impact latency. This would cover 2,500 requests, which is likely excessive, but was chosen for demonstration purposes. One can easily measure the distribution of epoll_wait batch sizes and batch sizes as low as 64 are already very efficient, even in high-load situations. Also see next paragraph. > Playing devil's advocate some more: given that ethtool usecs have to > be chosen with a similar trade-off between latency and efficiency, > could a multiplicative factor of this (or gro_flush_timeout, same > thing) be sufficient and easier to choose? The documentation does > state that the value chosen must be >= gro_flush_timeout. I believe this would take away flexibility without gaining much. You'd still want some sort of admin-controlled 'enable' flag, so you'd still need some kind of parameter. When using our scheme, the factor between gro_flush_timeout and irq_suspend_timeout should *roughly* correspond to the maximum batch size that an application would process in one go (orders of magnitude, see above). This determines both the target application's worst-case latency as well as the worst-case latency of concurrent applications, if any, as mentioned previously. I believe the optimal factor will vary between different scenarios. >>>>> If the only goal is to safely reenable interrupts when the application >>>>> stops calling epoll_wait, does this have to be user tunable? >>>>> >>>>> Can it be either a single good enough constant, or derived from >>>>> another tunable, like busypoll_read. >>>> >>>> I believe you meant busy_read here, is that right? >>>> >>>> At any rate: >>>> >>>> - I don't think a single constant is appropriate, just as it >>>> wasn't appropriate for the existing mechanism >>>> (napi_defer_hard_irqs/gro_flush_timeout), and >>>> >>>> - Deriving the value from a pre-existing parameter to preserve the >>>> ABI, like busy_read, makes using this more confusing for users >>>> and complicates the API significantly. >>>> >>>> I agree we should get the API right from the start; that's why we've >>>> submit this as an RFC ;) >>>> >>>> We are happy to take suggestions from the community, but, IMHO, >>>> re-using an existing parameter for a different purpose only in >>>> certain circumstances (if I understand your suggestions) is a much >>>> worse choice than adding a new tunable that clearly states its >>>> intended singular purpose. >>> >>> Ack. I was thinking whether an epoll flag through your new epoll >>> ioctl interface to toggle the IRQ suspension (and timer start) >>> would be preferable. Because more fine grained. >> >> A value provided by an application through the epoll ioctl would not be >> subject to admin oversight, so a misbehaving application could set an >> arbitrary timeout value. A sysfs value needs to be set by an admin. The >> ideal timeout value depends both on the particular target application as >> well as concurrent applications using the same queue(s) - as sketched above. > > I meant setting the value systemwide (or per-device), but opting in to > the feature a binary epoll options. Really an epoll_wait flag, if we > had flags. > > Any admin privileged operations can also be protected at the epoll > level by requiring CAP_NET_ADMIN too, of course. But fair point that > this might operate in a multi-process environment, so values should > not be hardcoded into the binaries. > > Just asking questions to explore the option space so as not to settle > on an API too soon. Given that, as said, we cannot remove it later. I agree, but I believe we are converging? Also taking into account Joe's earlier response, given that the suspend mechanism dovetails so nicely with gro_flush_timeout and napi_defer_hard_irqs, it just seems natural to put irq_suspend_timeout at the same level and I haven't seen any strong reason to put it elsewhere. >>> Also, the value is likely dependent more on the expected duration >>> of userspace processing? If so, it would be the same for all >>> devices, so does a per-netdev value make sense? >> >> It is per-netdev in the current proposal to be at the same granularity >> as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension >> operates at the same level/granularity. This allows for more control >> than a global setting and it can be migrated to per-napi settings along >> with gro_flush_timeout and napi_defer_hard_irqs when the time comes. > > Ack, makes sense. Many of these design choices and their rationale are > good to explicitly capture in the commit message. Agreed. Thanks, Martin
> >>>> The value may not be obvious, but guidance (in the form of > >>>> documentation) can be provided. > >>> > >>> Okay. Could you share a stab at what that would look like? > >> > >> The timeout needs to be large enough that an application can get a > >> meaningful number of incoming requests processed without softirq > >> interference. At the same time, the timeout value determines the > >> worst-case delivery delay that a concurrent application using the same > >> queue(s) might experience. Please also see my response to Samiullah > >> quoted above. The specific circumstances and trade-offs might vary, > >> that's why a simple constant likely won't do. > > > > Thanks. I really do mean this as an exercise of what documentation in > > Documentation/networking/napi.rst will look like. That helps makes the > > case that the interface is reasonably ease to use (even if only > > targeting advanced users). > > > > How does a user measure how much time a process will spend on > > processing a meaningful number of incoming requests, for instance. > > In practice, probably just a hunch? > > As an example, we measure around 1M QPS in our experiments, fully > utilizing 8 cores and knowing that memcached is quite scalable. Thus we > can conclude a single request takes about 8 us processing time on > average. That has led us to a 20 us small timeout (gro_flush_timeout), > enough to make sure that a single request is likely not interfered with, > but otherwise as small as possible. If multiple requests arrive, the > system will quickly switch back to polling mode. > > At the other end, we have picked a very large irq_suspend_timeout of > 20,000 us to demonstrate that it does not negatively impact latency. > This would cover 2,500 requests, which is likely excessive, but was > chosen for demonstration purposes. One can easily measure the > distribution of epoll_wait batch sizes and batch sizes as low as 64 are > already very efficient, even in high-load situations. Overall Ack on both your and Joe's responses. epoll_wait disables the suspend if no events are found and ep_poll would go to sleep. As the paper also hints, the timeout is only there for misbehaving applications that stop calling epoll_wait, correct? If so, then picking a value is not that critical, as long as not too low to do meaningful work. > Also see next paragraph. > > > Playing devil's advocate some more: given that ethtool usecs have to > > be chosen with a similar trade-off between latency and efficiency, > > could a multiplicative factor of this (or gro_flush_timeout, same > > thing) be sufficient and easier to choose? The documentation does > > state that the value chosen must be >= gro_flush_timeout. > > I believe this would take away flexibility without gaining much. You'd > still want some sort of admin-controlled 'enable' flag, so you'd still > need some kind of parameter. > > When using our scheme, the factor between gro_flush_timeout and > irq_suspend_timeout should *roughly* correspond to the maximum batch > size that an application would process in one go (orders of magnitude, > see above). This determines both the target application's worst-case > latency as well as the worst-case latency of concurrent applications, if > any, as mentioned previously. Oh is concurrent applications the argument against a very high timeout? > I believe the optimal factor will vary > between different scenarios. > > >>>>> If the only goal is to safely reenable interrupts when the application > >>>>> stops calling epoll_wait, does this have to be user tunable? > >>>>> > >>>>> Can it be either a single good enough constant, or derived from > >>>>> another tunable, like busypoll_read. > >>>> > >>>> I believe you meant busy_read here, is that right? > >>>> > >>>> At any rate: > >>>> > >>>> - I don't think a single constant is appropriate, just as it > >>>> wasn't appropriate for the existing mechanism > >>>> (napi_defer_hard_irqs/gro_flush_timeout), and > >>>> > >>>> - Deriving the value from a pre-existing parameter to preserve the > >>>> ABI, like busy_read, makes using this more confusing for users > >>>> and complicates the API significantly. > >>>> > >>>> I agree we should get the API right from the start; that's why we've > >>>> submit this as an RFC ;) > >>>> > >>>> We are happy to take suggestions from the community, but, IMHO, > >>>> re-using an existing parameter for a different purpose only in > >>>> certain circumstances (if I understand your suggestions) is a much > >>>> worse choice than adding a new tunable that clearly states its > >>>> intended singular purpose. > >>> > >>> Ack. I was thinking whether an epoll flag through your new epoll > >>> ioctl interface to toggle the IRQ suspension (and timer start) > >>> would be preferable. Because more fine grained. > >> > >> A value provided by an application through the epoll ioctl would not be > >> subject to admin oversight, so a misbehaving application could set an > >> arbitrary timeout value. A sysfs value needs to be set by an admin. The > >> ideal timeout value depends both on the particular target application as > >> well as concurrent applications using the same queue(s) - as sketched above. > > > > I meant setting the value systemwide (or per-device), but opting in to > > the feature a binary epoll options. Really an epoll_wait flag, if we > > had flags. > > > > Any admin privileged operations can also be protected at the epoll > > level by requiring CAP_NET_ADMIN too, of course. But fair point that > > this might operate in a multi-process environment, so values should > > not be hardcoded into the binaries. > > > > Just asking questions to explore the option space so as not to settle > > on an API too soon. Given that, as said, we cannot remove it later. > > I agree, but I believe we are converging? Also taking into account Joe's > earlier response, given that the suspend mechanism dovetails so nicely > with gro_flush_timeout and napi_defer_hard_irqs, it just seems natural > to put irq_suspend_timeout at the same level and I haven't seen any > strong reason to put it elsewhere. Yes, this sounds good. > >>> Also, the value is likely dependent more on the expected duration > >>> of userspace processing? If so, it would be the same for all > >>> devices, so does a per-netdev value make sense? > >> > >> It is per-netdev in the current proposal to be at the same granularity > >> as gro_flush_timeout and napi_defer_hard_irqs, because irq suspension > >> operates at the same level/granularity. This allows for more control > >> than a global setting and it can be migrated to per-napi settings along > >> with gro_flush_timeout and napi_defer_hard_irqs when the time comes. > > > > Ack, makes sense. Many of these design choices and their rationale are > > good to explicitly capture in the commit message. > > Agreed. > > Thanks, > Martin
On 2024-08-18 08:55, Willem de Bruijn wrote: >>>>>> The value may not be obvious, but guidance (in the form of >>>>>> documentation) can be provided. >>>>> >>>>> Okay. Could you share a stab at what that would look like? >>>> >>>> The timeout needs to be large enough that an application can get a >>>> meaningful number of incoming requests processed without softirq >>>> interference. At the same time, the timeout value determines the >>>> worst-case delivery delay that a concurrent application using the same >>>> queue(s) might experience. Please also see my response to Samiullah >>>> quoted above. The specific circumstances and trade-offs might vary, >>>> that's why a simple constant likely won't do. >>> >>> Thanks. I really do mean this as an exercise of what documentation in >>> Documentation/networking/napi.rst will look like. That helps makes the >>> case that the interface is reasonably ease to use (even if only >>> targeting advanced users). >>> >>> How does a user measure how much time a process will spend on >>> processing a meaningful number of incoming requests, for instance. >>> In practice, probably just a hunch? >> >> As an example, we measure around 1M QPS in our experiments, fully >> utilizing 8 cores and knowing that memcached is quite scalable. Thus we >> can conclude a single request takes about 8 us processing time on >> average. That has led us to a 20 us small timeout (gro_flush_timeout), >> enough to make sure that a single request is likely not interfered with, >> but otherwise as small as possible. If multiple requests arrive, the >> system will quickly switch back to polling mode. >> >> At the other end, we have picked a very large irq_suspend_timeout of >> 20,000 us to demonstrate that it does not negatively impact latency. >> This would cover 2,500 requests, which is likely excessive, but was >> chosen for demonstration purposes. One can easily measure the >> distribution of epoll_wait batch sizes and batch sizes as low as 64 are >> already very efficient, even in high-load situations. > > Overall Ack on both your and Joe's responses. > > epoll_wait disables the suspend if no events are found and ep_poll > would go to sleep. As the paper also hints, the timeout is only there > for misbehaving applications that stop calling epoll_wait, correct? > If so, then picking a value is not that critical, as long as not too > low to do meaningful work. Correct. >> Also see next paragraph. >> >>> Playing devil's advocate some more: given that ethtool usecs have to >>> be chosen with a similar trade-off between latency and efficiency, >>> could a multiplicative factor of this (or gro_flush_timeout, same >>> thing) be sufficient and easier to choose? The documentation does >>> state that the value chosen must be >= gro_flush_timeout. >> >> I believe this would take away flexibility without gaining much. You'd >> still want some sort of admin-controlled 'enable' flag, so you'd still >> need some kind of parameter. >> >> When using our scheme, the factor between gro_flush_timeout and >> irq_suspend_timeout should *roughly* correspond to the maximum batch >> size that an application would process in one go (orders of magnitude, >> see above). This determines both the target application's worst-case >> latency as well as the worst-case latency of concurrent applications, if >> any, as mentioned previously. > > Oh is concurrent applications the argument against a very high > timeout? Only in the error case. If suspend_irq_timeout is large enough as you point out above, then as long as the target application behaves well, its batching settings are the determining factor. Thanks, Martin
On Tue, 13 Aug 2024 21:14:40 -0400 Martin Karsten wrote: > > What about NIC interrupt coalescing. defer_hard_irqs_count was supposed > > to be used with NICs which either don't have IRQ coalescing or have a > > broken implementation. The timeout of 200usec should be perfectly within > > range of what NICs can support. > > > > If the NIC IRQ coalescing works, instead of adding a new timeout value > > we could add a new deferral control (replacing defer_hard_irqs_count) > > which would always kick in after seeing prefer_busy_poll() but also > > not kick in if the busy poll harvested 0 packets. > Maybe I am missing something, but I believe this would have the same > problem that we describe for gro-timeout + defer-irq. When busy poll > does not harvest packets and the application thread is idle and goes to > sleep, it would then take up to 200 us to get the next interrupt. This > considerably increases tail latencies under low load. > > In order get low latencies under low load, the NIC timeout would have to > be something like 20 us, but under high load the application thread will > be busy for longer than 20 us and the interrupt (and softirq) will come > too early and cause interference. An FSM-like diagram would go a long way in clarifying things :) > It is tempting to think of the second timeout as 0 and in fact re-enable > interrupts right away. We have tried it, but it leads to a lot of > interrupts and corresponding inefficiencies, since a system below > capacity frequently switches between busy and idle. Using a small > timeout (20 us) for modest deferral and batching when idle is a lot more > efficient. I see. I think we are on the same page. What I was suggesting is to use the HW timer instead of the short timer. But I suspect the NIC you're using isn't really good at clearing IRQs before unmasking. Meaning that when you try to reactivate HW control there's already an IRQ pending and it fires pointlessly. That matches my experience with mlx5. If the NIC driver was to clear the IRQ state before running the NAPI loop, we would have no pending IRQ by the time we unmask and activate HW IRQs. Sorry for the delay.
On Sun, 18 Aug 2024 10:51:04 -0400 Martin Karsten wrote: > >> I believe this would take away flexibility without gaining much. You'd > >> still want some sort of admin-controlled 'enable' flag, so you'd still > >> need some kind of parameter. > >> > >> When using our scheme, the factor between gro_flush_timeout and > >> irq_suspend_timeout should *roughly* correspond to the maximum batch > >> size that an application would process in one go (orders of magnitude, > >> see above). This determines both the target application's worst-case > >> latency as well as the worst-case latency of concurrent applications, if > >> any, as mentioned previously. > > > > Oh is concurrent applications the argument against a very high > > timeout? > > Only in the error case. If suspend_irq_timeout is large enough as you > point out above, then as long as the target application behaves well, > its batching settings are the determining factor. Since the discussion is still sort of going on let me ask something potentially stupid (I haven't read the paper, yet). Are the cores assumed to be fully isolated (ergo the application can only yield to the idle thread)? Do we not have to worry about the scheduler deciding to schedule the process out involuntarily?
On 2024-08-19 22:07, Jakub Kicinski wrote: > On Tue, 13 Aug 2024 21:14:40 -0400 Martin Karsten wrote: >>> What about NIC interrupt coalescing. defer_hard_irqs_count was supposed >>> to be used with NICs which either don't have IRQ coalescing or have a >>> broken implementation. The timeout of 200usec should be perfectly within >>> range of what NICs can support. >>> >>> If the NIC IRQ coalescing works, instead of adding a new timeout value >>> we could add a new deferral control (replacing defer_hard_irqs_count) >>> which would always kick in after seeing prefer_busy_poll() but also >>> not kick in if the busy poll harvested 0 packets. >> Maybe I am missing something, but I believe this would have the same >> problem that we describe for gro-timeout + defer-irq. When busy poll >> does not harvest packets and the application thread is idle and goes to >> sleep, it would then take up to 200 us to get the next interrupt. This >> considerably increases tail latencies under low load. >> >> In order get low latencies under low load, the NIC timeout would have to >> be something like 20 us, but under high load the application thread will >> be busy for longer than 20 us and the interrupt (and softirq) will come >> too early and cause interference. > > An FSM-like diagram would go a long way in clarifying things :) I agree the suspend mechanism is not trivial and the implementation is subtle. It has frequently made our heads hurt while developing this. We will take a long hard look at our cover letter and produce other documentation to hopefully provide clear explanations. >> It is tempting to think of the second timeout as 0 and in fact re-enable >> interrupts right away. We have tried it, but it leads to a lot of >> interrupts and corresponding inefficiencies, since a system below >> capacity frequently switches between busy and idle. Using a small >> timeout (20 us) for modest deferral and batching when idle is a lot more >> efficient. > > I see. I think we are on the same page. What I was suggesting is to use > the HW timer instead of the short timer. But I suspect the NIC you're > using isn't really good at clearing IRQs before unmasking. Meaning that > when you try to reactivate HW control there's already an IRQ pending > and it fires pointlessly. That matches my experience with mlx5. > If the NIC driver was to clear the IRQ state before running the NAPI > loop, we would have no pending IRQ by the time we unmask and activate > HW IRQs. I believe there are additional issues. The problem is that the long timeout must engage if and only if prefer-busy is active. When using NIC coalescing for the short timeout (without gro/defer), an interrupt after an idle period will trigger softirq, which will run napi polling. At this point, prefer-busy is not active, so NIC interrupts would be re-enabled. Then it is not possible for the longer timeout to interject to switch control back to polling. In other words, only by using the software timer for the short timeout, it is possible to extend the timeout without having to reprogram the NIC timer or reach down directly and disable interrupts. Using gro_flush_timeout for the long timeout also has problems, for the same underlying reason. In the current napi implementation, gro_flush_timeout is not tied to prefer-busy. We'd either have to change that and in the process modify the existing deferral mechanism, or introduce a state variable to determine whether gro_flush_timeout is used as long timeout for irq suspend or whether it is used for its default purpose. In an earlier version, we did try something similar to the latter and made it work, but it ends up being a lot more convoluted than our current proposal. Thanks, Martin
On 2024-08-19 22:36, Jakub Kicinski wrote: > On Sun, 18 Aug 2024 10:51:04 -0400 Martin Karsten wrote: >>>> I believe this would take away flexibility without gaining much. You'd >>>> still want some sort of admin-controlled 'enable' flag, so you'd still >>>> need some kind of parameter. >>>> >>>> When using our scheme, the factor between gro_flush_timeout and >>>> irq_suspend_timeout should *roughly* correspond to the maximum batch >>>> size that an application would process in one go (orders of magnitude, >>>> see above). This determines both the target application's worst-case >>>> latency as well as the worst-case latency of concurrent applications, if >>>> any, as mentioned previously. >>> >>> Oh is concurrent applications the argument against a very high >>> timeout? >> >> Only in the error case. If suspend_irq_timeout is large enough as you >> point out above, then as long as the target application behaves well, >> its batching settings are the determining factor. > > Since the discussion is still sort of going on let me ask something > potentially stupid (I haven't read the paper, yet). Are the cores > assumed to be fully isolated (ergo the application can only yield > to the idle thread)? Do we not have to worry about the scheduler > deciding to schedule the process out involuntarily? That shouldn't be a problem. If the next thread(s) can make progress, nothing is lost. If the next thread(s) cannot make progress, for example waiting for network I/O, they will block and the target application thread will run again. If another thread is busy-looping on network I/O, I would argue that having multiple busy-looping threads competing for the same core is probably not a good idea anyway. Thanks, Martin