Message ID | cover.1604055792.git.pabeni@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | net: introduce rps_default_mask | expand |
On Fri, 30 Oct 2020 12:16:00 +0100 Paolo Abeni wrote: > Real-time setups try hard to ensure proper isolation between time > critical applications and e.g. network processing performed by the > network stack in softirq and RPS is used to move the softirq > activity away from the isolated core. > > If the network configuration is dynamic, with netns and devices > routinely created at run-time, enforcing the correct RPS setting > on each newly created device allowing to transient bad configuration > became complex. > > These series try to address the above, introducing a new > sysctl knob: rps_default_mask. The new sysctl entry allows > configuring a systemwide RPS mask, to be enforced since receive > queue creation time without any fourther per device configuration > required. > > Additionally, a simple self-test is introduced to check the > rps_default_mask behavior. RPS is disabled by default, the processing is going to happen wherever the IRQ is mapped, and one would hope that the IRQ is not mapped to the core where the critical processing runs. Would you mind elaborating further on the use case?
On Mon, 2020-11-02 at 14:54 -0800, Jakub Kicinski wrote: > On Fri, 30 Oct 2020 12:16:00 +0100 Paolo Abeni wrote: > > Real-time setups try hard to ensure proper isolation between time > > critical applications and e.g. network processing performed by the > > network stack in softirq and RPS is used to move the softirq > > activity away from the isolated core. > > > > If the network configuration is dynamic, with netns and devices > > routinely created at run-time, enforcing the correct RPS setting > > on each newly created device allowing to transient bad > > configuration > > became complex. > > > > These series try to address the above, introducing a new > > sysctl knob: rps_default_mask. The new sysctl entry allows > > configuring a systemwide RPS mask, to be enforced since receive > > queue creation time without any fourther per device configuration > > required. > > The whole thing can be replaced with a user daemon scripts that monitors all newly created devices and assign to them whatever rps mask (call it default). So why do we need this special logic in kernel ? I am not sure about this, but if rps queues sysfs are available before the netdev is up, then you can also use udevd to assign the rps masks before such devices are even brought up, so you would avoid the race conditions that you described, which are not really clear to me to be honest. > > Additionally, a simple self-test is introduced to check the > > rps_default_mask behavior. > > RPS is disabled by default, the processing is going to happen > wherever > the IRQ is mapped, and one would hope that the IRQ is not mapped to > the > core where the critical processing runs. > > Would you mind elaborating further on the use case?
On Mon, 2020-11-02 at 14:54 -0800, Jakub Kicinski wrote: > On Fri, 30 Oct 2020 12:16:00 +0100 Paolo Abeni wrote: > > Real-time setups try hard to ensure proper isolation between time > > critical applications and e.g. network processing performed by the > > network stack in softirq and RPS is used to move the softirq > > activity away from the isolated core. > > > > If the network configuration is dynamic, with netns and devices > > routinely created at run-time, enforcing the correct RPS setting > > on each newly created device allowing to transient bad configuration > > became complex. > > > > These series try to address the above, introducing a new > > sysctl knob: rps_default_mask. The new sysctl entry allows > > configuring a systemwide RPS mask, to be enforced since receive > > queue creation time without any fourther per device configuration > > required. > > > > Additionally, a simple self-test is introduced to check the > > rps_default_mask behavior. > > RPS is disabled by default, the processing is going to happen wherever > the IRQ is mapped, and one would hope that the IRQ is not mapped to the > core where the critical processing runs. > > Would you mind elaborating further on the use case? On Mon, 2020-11-02 at 15:27 -0800, Saeed Mahameed wrote: > The whole thing can be replaced with a user daemon scripts that > monitors all newly created devices and assign to them whatever rps mask > (call it default). > > So why do we need this special logic in kernel ? > > I am not sure about this, but if rps queues sysfs are available before > the netdev is up, then you can also use udevd to assign the rps masks > before such devices are even brought up, so you would avoid the race > conditions that you described, which are not really clear to me to be > honest. Thank you for the feedback. Please allow me to answer you both here, as your questions are related. The relevant use case is an host running containers (with the related orchestration tools) in a RT environment. Virtual devices (veths, ovs ports, etc.) are created by the orchestration tools at run-time. Critical processes are allowed to send packets/generate outgoing network traffic - but any interrupt is moved away from the related cores, so that usual incoming network traffic processing does not happen there. Still an xmit operation on a virtual devices may be transmitted via ovs or veth, with the relevant forwarding operation happening in a softirq on the same CPU originating the packet. RPS is configured (even) on such virtual devices to move away the forwarding from the relevant CPUs. As Saeed noted, such configuration could be possibly performed via some user-space daemon monitoring network devices and network namespaces creation. That will be anyway prone to some race: the orchestation tool may create and enable the netns and virtual devices before the daemon has properly set the RPS mask. In the latter scenario some packet forwarding could still slip in the relevant CPU, causing measurable latency. In all non RT scenarios the above will be likely irrelevant, but in the RT context that is not acceptable - e.g. it causes in real environments latency above the defined limits, while the proposed patches avoid the issue. Do you see any other simple way to avoid the above race? Please let me know if the above answers your doubts, Paolo
On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > On Mon, 2020-11-02 at 14:54 -0800, Jakub Kicinski wrote: > > On Fri, 30 Oct 2020 12:16:00 +0100 Paolo Abeni wrote: > > > Real-time setups try hard to ensure proper isolation between time > > > critical applications and e.g. network processing performed by the > > > network stack in softirq and RPS is used to move the softirq > > > activity away from the isolated core. > > > > > > If the network configuration is dynamic, with netns and devices > > > routinely created at run-time, enforcing the correct RPS setting > > > on each newly created device allowing to transient bad configuration > > > became complex. > > > > > > These series try to address the above, introducing a new > > > sysctl knob: rps_default_mask. The new sysctl entry allows > > > configuring a systemwide RPS mask, to be enforced since receive > > > queue creation time without any fourther per device configuration > > > required. > > > > > > Additionally, a simple self-test is introduced to check the > > > rps_default_mask behavior. > > > > RPS is disabled by default, the processing is going to happen wherever > > the IRQ is mapped, and one would hope that the IRQ is not mapped to the > > core where the critical processing runs. > > > > Would you mind elaborating further on the use case? > > On Mon, 2020-11-02 at 15:27 -0800, Saeed Mahameed wrote: > > The whole thing can be replaced with a user daemon scripts that > > monitors all newly created devices and assign to them whatever rps mask > > (call it default). > > > > So why do we need this special logic in kernel ? > > > > I am not sure about this, but if rps queues sysfs are available before > > the netdev is up, then you can also use udevd to assign the rps masks > > before such devices are even brought up, so you would avoid the race > > conditions that you described, which are not really clear to me to be > > honest. > > Thank you for the feedback. > > Please allow me to answer you both here, as your questions are related. > > The relevant use case is an host running containers (with the related > orchestration tools) in a RT environment. Virtual devices (veths, ovs > ports, etc.) are created by the orchestration tools at run-time. > Critical processes are allowed to send packets/generate outgoing > network traffic - but any interrupt is moved away from the related > cores, so that usual incoming network traffic processing does not > happen there. > > Still an xmit operation on a virtual devices may be transmitted via ovs > or veth, with the relevant forwarding operation happening in a softirq > on the same CPU originating the packet. > > RPS is configured (even) on such virtual devices to move away the > forwarding from the relevant CPUs. > > As Saeed noted, such configuration could be possibly performed via some > user-space daemon monitoring network devices and network namespaces > creation. That will be anyway prone to some race: the orchestation tool > may create and enable the netns and virtual devices before the daemon > has properly set the RPS mask. > > In the latter scenario some packet forwarding could still slip in the > relevant CPU, causing measurable latency. In all non RT scenarios the > above will be likely irrelevant, but in the RT context that is not > acceptable - e.g. it causes in real environments latency above the > defined limits, while the proposed patches avoid the issue. > > Do you see any other simple way to avoid the above race? > > Please let me know if the above answers your doubts, Thanks, that makes it clearer now. Depending on how RT-aware your container management is it may or may not be the right place to configure this, as it creates the veth interface. Presumably it's the container management which does the placement of the tasks to cores, why is it not setting other attributes, like RPS? Also I wonder if it would make sense to turn this knob into something more generic. When we arrive at the threaded NAPIs - could it make sense for the threads to inherit your mask as the CPUs they are allowed to run on?
On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > > The relevant use case is an host running containers (with the related > > orchestration tools) in a RT environment. Virtual devices (veths, ovs > > ports, etc.) are created by the orchestration tools at run-time. > > Critical processes are allowed to send packets/generate outgoing > > network traffic - but any interrupt is moved away from the related > > cores, so that usual incoming network traffic processing does not > > happen there. > > > > Still an xmit operation on a virtual devices may be transmitted via ovs > > or veth, with the relevant forwarding operation happening in a softirq > > on the same CPU originating the packet. > > > > RPS is configured (even) on such virtual devices to move away the > > forwarding from the relevant CPUs. > > > > As Saeed noted, such configuration could be possibly performed via some > > user-space daemon monitoring network devices and network namespaces > > creation. That will be anyway prone to some race: the orchestation tool > > may create and enable the netns and virtual devices before the daemon > > has properly set the RPS mask. > > > > In the latter scenario some packet forwarding could still slip in the > > relevant CPU, causing measurable latency. In all non RT scenarios the > > above will be likely irrelevant, but in the RT context that is not > > acceptable - e.g. it causes in real environments latency above the > > defined limits, while the proposed patches avoid the issue. > > > > Do you see any other simple way to avoid the above race? > > > > Please let me know if the above answers your doubts, > > Thanks, that makes it clearer now. > > Depending on how RT-aware your container management is it may or may not > be the right place to configure this, as it creates the veth interface. > Presumably it's the container management which does the placement of > the tasks to cores, why is it not setting other attributes, like RPS? The container orchestration is quite complex, and I'm unsure isolation and networking configuration are performed (or can be performed) by the same precess (without an heavy refactor). On the flip hand, the global rps mask knob looked quite straightforward to me. Possibly I can reduce the amount of new code introduced by this patchset removing some code duplication between rps_default_mask_sysctl() and flow_limit_cpu_sysctl(). Would that make this change more acceptable? Or should I drop this altogether? > Also I wonder if it would make sense to turn this knob into something > more generic. When we arrive at the threaded NAPIs - could it make > sense for the threads to inherit your mask as the CPUs they are allowed > to run on? I personally *think* this would be fine - and good. But isn't a bit premature discussing the integration of 2 missing pieces ? :) Thanks, Paolo
On Wed, Nov 04, 2020 at 06:36:08PM +0100, Paolo Abeni wrote: > On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > > On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > > > The relevant use case is an host running containers (with the related > > > orchestration tools) in a RT environment. Virtual devices (veths, ovs > > > ports, etc.) are created by the orchestration tools at run-time. > > > Critical processes are allowed to send packets/generate outgoing > > > network traffic - but any interrupt is moved away from the related > > > cores, so that usual incoming network traffic processing does not > > > happen there. > > > > > > Still an xmit operation on a virtual devices may be transmitted via ovs > > > or veth, with the relevant forwarding operation happening in a softirq > > > on the same CPU originating the packet. > > > > > > RPS is configured (even) on such virtual devices to move away the > > > forwarding from the relevant CPUs. > > > > > > As Saeed noted, such configuration could be possibly performed via some > > > user-space daemon monitoring network devices and network namespaces > > > creation. That will be anyway prone to some race: the orchestation tool > > > may create and enable the netns and virtual devices before the daemon > > > has properly set the RPS mask. > > > > > > In the latter scenario some packet forwarding could still slip in the > > > relevant CPU, causing measurable latency. In all non RT scenarios the > > > above will be likely irrelevant, but in the RT context that is not > > > acceptable - e.g. it causes in real environments latency above the > > > defined limits, while the proposed patches avoid the issue. > > > > > > Do you see any other simple way to avoid the above race? > > > > > > Please let me know if the above answers your doubts, > > > > Thanks, that makes it clearer now. > > > > Depending on how RT-aware your container management is it may or may not > > be the right place to configure this, as it creates the veth interface. > > Presumably it's the container management which does the placement of > > the tasks to cores, why is it not setting other attributes, like RPS? > > The container orchestration is quite complex, and I'm unsure isolation > and networking configuration are performed (or can be performed) by the > same precess (without an heavy refactor). Also for the host side (no containers) the same issue will have to be handled for PCI hotplug for example. So this fix will have to be performed in every tool that decides to create a network device (while a kernel solution is global). > On the flip hand, the global rps mask knob looked quite > straightforward to me. > > Possibly I can reduce the amount of new code introduced by this > patchset removing some code duplication > between rps_default_mask_sysctl() and flow_limit_cpu_sysctl(). Would > that make this change more acceptable? Or should I drop this > altogether? > > > Also I wonder if it would make sense to turn this knob into something > > more generic. When we arrive at the threaded NAPIs - could it make > > sense for the threads to inherit your mask as the CPUs they are allowed > > to run on? > > I personally *think* this would be fine - and good. But isn't a bit > premature discussing the integration of 2 missing pieces ? :) > > Thanks, > > Paolo About the potential race: 0) network device creation starts, inherits old default_rps_mask, network device init sleeps 1) set default_rps_mask (new) 2) change all devices across all network namespaces (walk /sys) 3) network device init wakes up, new device shows up in /sys/ using old default_rps_mask Why this can't happen?
On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote: > On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > > On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > > > The relevant use case is an host running containers (with the related > > > orchestration tools) in a RT environment. Virtual devices (veths, ovs > > > ports, etc.) are created by the orchestration tools at run-time. > > > Critical processes are allowed to send packets/generate outgoing > > > network traffic - but any interrupt is moved away from the related > > > cores, so that usual incoming network traffic processing does not > > > happen there. > > > > > > Still an xmit operation on a virtual devices may be transmitted via ovs > > > or veth, with the relevant forwarding operation happening in a softirq > > > on the same CPU originating the packet. > > > > > > RPS is configured (even) on such virtual devices to move away the > > > forwarding from the relevant CPUs. > > > > > > As Saeed noted, such configuration could be possibly performed via some > > > user-space daemon monitoring network devices and network namespaces > > > creation. That will be anyway prone to some race: the orchestation tool > > > may create and enable the netns and virtual devices before the daemon > > > has properly set the RPS mask. > > > > > > In the latter scenario some packet forwarding could still slip in the > > > relevant CPU, causing measurable latency. In all non RT scenarios the > > > above will be likely irrelevant, but in the RT context that is not > > > acceptable - e.g. it causes in real environments latency above the > > > defined limits, while the proposed patches avoid the issue. > > > > > > Do you see any other simple way to avoid the above race? > > > > > > Please let me know if the above answers your doubts, > > > > Thanks, that makes it clearer now. > > > > Depending on how RT-aware your container management is it may or may not > > be the right place to configure this, as it creates the veth interface. > > Presumably it's the container management which does the placement of > > the tasks to cores, why is it not setting other attributes, like RPS? > > The container orchestration is quite complex, and I'm unsure isolation > and networking configuration are performed (or can be performed) by the > same precess (without an heavy refactor). > > On the flip hand, the global rps mask knob looked quite > straightforward to me. I understand, but I can't shake the feeling this is a hack. Whatever sets the CPU isolation should take care of the RPS settings. > Possibly I can reduce the amount of new code introduced by this > patchset removing some code duplication > between rps_default_mask_sysctl() and flow_limit_cpu_sysctl(). Would > that make this change more acceptable? Or should I drop this > altogether? I'm leaning towards drop altogether, unless you can get some support/review tags from other netdev developers. So far it appears we only got a down vote from Saeed. > > Also I wonder if it would make sense to turn this knob into something > > more generic. When we arrive at the threaded NAPIs - could it make > > sense for the threads to inherit your mask as the CPUs they are allowed > > to run on? > > I personally *think* this would be fine - and good. But isn't a bit > premature discussing the integration of 2 missing pieces ? :)
Hi all, On Wed, 2020-11-04 at 12:42 -0700, Jakub Kicinski wrote: > On Wed, 04 Nov 2020 18:36:08 +0100 Paolo Abeni wrote: > > On Tue, 2020-11-03 at 08:52 -0800, Jakub Kicinski wrote: > > > On Tue, 03 Nov 2020 16:22:07 +0100 Paolo Abeni wrote: > > > > The relevant use case is an host running containers (with the related > > > > orchestration tools) in a RT environment. Virtual devices (veths, ovs > > > > ports, etc.) are created by the orchestration tools at run-time. > > > > Critical processes are allowed to send packets/generate outgoing > > > > network traffic - but any interrupt is moved away from the related > > > > cores, so that usual incoming network traffic processing does not > > > > happen there. > > > > > > > > Still an xmit operation on a virtual devices may be transmitted via ovs > > > > or veth, with the relevant forwarding operation happening in a softirq > > > > on the same CPU originating the packet. > > > > > > > > RPS is configured (even) on such virtual devices to move away the > > > > forwarding from the relevant CPUs. > > > > > > > > As Saeed noted, such configuration could be possibly performed via some > > > > user-space daemon monitoring network devices and network namespaces > > > > creation. That will be anyway prone to some race: the orchestation tool > > > > may create and enable the netns and virtual devices before the daemon > > > > has properly set the RPS mask. > > > > > > > > In the latter scenario some packet forwarding could still slip in the > > > > relevant CPU, causing measurable latency. In all non RT scenarios the > > > > above will be likely irrelevant, but in the RT context that is not > > > > acceptable - e.g. it causes in real environments latency above the > > > > defined limits, while the proposed patches avoid the issue. > > > > > > > > Do you see any other simple way to avoid the above race? > > > > > > > > Please let me know if the above answers your doubts, > > > > > > Thanks, that makes it clearer now. > > > > > > Depending on how RT-aware your container management is it may or may not > > > be the right place to configure this, as it creates the veth interface. > > > Presumably it's the container management which does the placement of > > > the tasks to cores, why is it not setting other attributes, like RPS? > > > > The container orchestration is quite complex, and I'm unsure isolation > > and networking configuration are performed (or can be performed) by the > > same precess (without an heavy refactor). > > > > On the flip hand, the global rps mask knob looked quite > > straightforward to me. > > I understand, but I can't shake the feeling this is a hack. > > Whatever sets the CPU isolation should take care of the RPS settings. Let me try for a moment to revive this old thread. Tha series proposed a new sysctl know to implement a global/default rps mask applying to all the network devices as a way to simplify some RT setups. It has been rejected as the required task is doable in user- space. Currently the orchestration infrastructure does that, setting the per device, per queue rps mask and CPU isolation. The above leads to a side problem: when there are lot of netns/devices with several queues, even a reasonably optimized user-space solution takes a relevant amount of time to traverse the relevant sysfs dirs and do I/O on them. Overall the additional time required is very measurable, easily ranging in seconds. The default_rps_mask would basically kill that overhead. Is the above a suitable use case? Thanks, Paolo
On Mon, 30 Jan 2023 10:25:34 +0100 Paolo Abeni wrote: > Let me try for a moment to revive this old thread. > > Tha series proposed a new sysctl know to implement a global/default rps > mask applying to all the network devices as a way to simplify some RT > setups. It has been rejected as the required task is doable in user- > space. > > Currently the orchestration infrastructure does that, setting the per > device, per queue rps mask and CPU isolation. > > The above leads to a side problem: when there are lot of netns/devices > with several queues, even a reasonably optimized user-space solution > takes a relevant amount of time to traverse the relevant sysfs dirs and > do I/O on them. Overall the additional time required is very > measurable, easily ranging in seconds. > > The default_rps_mask would basically kill that overhead. > > Is the above a suitable use case? Alright, thanks for trying the user space fix.