Message ID | 20220525221055.1152307-5-frederic@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | rcu/cpuset: Control RCU_NOCB offloading through cpusets | expand |
On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote: > Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset > directory which provides support for a set of CPUs to either enable ("1") > or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can > overwrite previous boot settings towards "rcu_nocbs=" kernel parameter. > > The file is only writeable on "root" type partitions to exclude any > overlap. The deepest root type partition has the highest priority. > This means that given the following setting: > > Top cpuset (CPUs: 0-7) > cpuset.isolation.rcu_nocb = 0 > | > | > Subdirectory A (CPUs: 5-7) > cpuset.cpus.partition = root > cpuset.isolation.rcu_nocb = 0 > | > | > Subdirectory B (CPUs: 7) > cpuset.cpus.partition = root > cpuset.isolation.rcu_nocb = 1 > > the result is that only CPU 7 is in rcu_nocb mode. > > Note that "rcu_nocbs" kernel parameter must be passed on boot, even > without a cpulist, so that nocb support is enabled. Does it even make sense to make this hierarchical? What's wrong with a cpumask under sys/ or proc/? Thanks.
On Thu, May 26, 2022 at 08:21:13AM -1000, Tejun Heo wrote: > On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote: > > Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset > > directory which provides support for a set of CPUs to either enable ("1") > > or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can > > overwrite previous boot settings towards "rcu_nocbs=" kernel parameter. > > > > The file is only writeable on "root" type partitions to exclude any > > overlap. The deepest root type partition has the highest priority. > > This means that given the following setting: > > > > Top cpuset (CPUs: 0-7) > > cpuset.isolation.rcu_nocb = 0 > > | > > | > > Subdirectory A (CPUs: 5-7) > > cpuset.cpus.partition = root > > cpuset.isolation.rcu_nocb = 0 > > | > > | > > Subdirectory B (CPUs: 7) > > cpuset.cpus.partition = root > > cpuset.isolation.rcu_nocb = 1 > > > > the result is that only CPU 7 is in rcu_nocb mode. > > > > Note that "rcu_nocbs" kernel parameter must be passed on boot, even > > without a cpulist, so that nocb support is enabled. > > Does it even make sense to make this hierarchical? What's wrong with a > cpumask under sys/ or proc/? I'm usually told that cpusets is the current place where CPU attributes are supposed to go. I personally don't mind much /sys either even though cpusets looks like a more flexible way to partition CPUs with properties and tasks placement altogether...
On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote: > > Does it even make sense to make this hierarchical? What's wrong with a > > cpumask under sys/ or proc/? > > I'm usually told that cpusets is the current place where CPU attributes are > supposed to go. I personally don't mind much /sys either even though cpusets > looks like a more flexible way to partition CPUs with properties and tasks > placement altogether... Yeah, I mean, if it's hierarchical, it's the right place but I have a hard time seeing anything hierarchical with this one. Somebody just has to know which cpus are up for rcu processing and which aren't. Waiman, what do you think? Thanks.
On 5/26/22 19:02, Tejun Heo wrote: > On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote: >>> Does it even make sense to make this hierarchical? What's wrong with a >>> cpumask under sys/ or proc/? >> I'm usually told that cpusets is the current place where CPU attributes are >> supposed to go. I personally don't mind much /sys either even though cpusets >> looks like a more flexible way to partition CPUs with properties and tasks >> placement altogether... > Yeah, I mean, if it's hierarchical, it's the right place but I have a hard > time seeing anything hierarchical with this one. Somebody just has to know > which cpus are up for rcu processing and which aren't. Waiman, what do you > think? I am thinking along the line that it will not be hierarchical. However, cpuset can be useful if we want to have multiple isolated partitions underneath the top cpuset with different isolation attributes, but no more sub-isolated partition with sub-attributes underneath them. IOW, we can only set them at the first level under top_cpuset. Will that be useful? Cheers, Longman
On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > I am thinking along the line that it will not be hierarchical. However, > cpuset can be useful if we want to have multiple isolated partitions > underneath the top cpuset with different isolation attributes, but no more > sub-isolated partition with sub-attributes underneath them. IOW, we can only > set them at the first level under top_cpuset. Will that be useful? At that point, I'd just prefer to have it under /proc or /sys. Thanks.
Hi, On 26/05/22 14:37, Tejun Heo wrote: > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > I am thinking along the line that it will not be hierarchical. However, > > cpuset can be useful if we want to have multiple isolated partitions > > underneath the top cpuset with different isolation attributes, but no more > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > set them at the first level under top_cpuset. Will that be useful? > > At that point, I'd just prefer to have it under /proc or /sys. FWIW, I was under the impression that this would nicely fit along the side of other feaures towards implenting dynamic isolation of CPUs (say https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ for example). Wouldn't be awkward to have to poke different places to achieve isolation at runtime? Also, I wonder if a proc/sys interface might be problematic for certain middleware that is substantially based on using cgroups. I'll try to ask around. :) Best, Juri
Hello, On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > FWIW, I was under the impression that this would nicely fit along the > side of other feaures towards implenting dynamic isolation of CPUs (say > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > for example). Wouldn't be awkward to have to poke different places to > achieve isolation at runtime? So, it were just being part of the isolated domain thing, it would make sense, but as a separate flag which isn't hierarchical, it's weird to put it there. > Also, I wonder if a proc/sys interface might be problematic for certain > middleware that is substantially based on using cgroups. I'll try to ask > around. :) There is a downside to making a feature a part of cpuset in that it makes cgroup usage mandatory. This is fine for something which benefits from hierarchical organization but it is weird to require building cgroup hierarchy for straight-forward system-wide features. Thanks.
Hi, On Thu, May 26, 2022 at 10:45:00PM -1000 Tejun Heo wrote: > Hello, > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > FWIW, I was under the impression that this would nicely fit along the > > side of other feaures towards implenting dynamic isolation of CPUs (say > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > for example). Wouldn't be awkward to have to poke different places to > > achieve isolation at runtime? > > So, it were just being part of the isolated domain thing, it would make > sense, but as a separate flag which isn't hierarchical, it's weird to put it > there. The way I see it is more that the "isolated domain thing" is one part of this whole dynamic isolation thing and is just one flag among many (most still on the drawing board, but planned). It may be that Waiman's "isolated" should be renamed "no_load_balance" or something. Part of this is making cpu isolation more granular. > > > Also, I wonder if a proc/sys interface might be problematic for certain > > middleware that is substantially based on using cgroups. I'll try to ask > > around. :) > > There is a downside to making a feature a part of cpuset in that it makes > cgroup usage mandatory. This is fine for something which benefits from > hierarchical organization but it is weird to require building cgroup > hierarchy for straight-forward system-wide features. > That ship may have sailed when SD_LOAD_BALANCE was removed, which is part of what Waiman's feature addresses. That is, now in order to get control over the system-wide feature of which CPUs get scheduler load balanced you need to use cpusets. My 3 cents anyway (inflation ;) Cheers, Phil > Thanks. > > -- > tejun > --
On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > Hi, > > On 26/05/22 14:37, Tejun Heo wrote: > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > I am thinking along the line that it will not be hierarchical. However, > > > cpuset can be useful if we want to have multiple isolated partitions > > > underneath the top cpuset with different isolation attributes, but no more > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > set them at the first level under top_cpuset. Will that be useful? > > > > At that point, I'd just prefer to have it under /proc or /sys. > > FWIW, I was under the impression that this would nicely fit along the > side of other feaures towards implenting dynamic isolation of CPUs (say > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > for example). Wouldn't be awkward to have to poke different places to > achieve isolation at runtime? This, that's what I was thinking. My main objection to the whole thing is that it's an RCU_NOCB specific interface. *That* I think is daft. I was thinking a partition would be able to designate a house-keeping sub-partition/mask, but who cares about all the various different housekeeping parties.
On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > Hi, > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > I am thinking along the line that it will not be hierarchical. However, > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > underneath the top cpuset with different isolation attributes, but no more > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > FWIW, I was under the impression that this would nicely fit along the > > side of other feaures towards implenting dynamic isolation of CPUs (say > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > for example). Wouldn't be awkward to have to poke different places to > > achieve isolation at runtime? > > This, that's what I was thinking. > > My main objection to the whole thing is that it's an RCU_NOCB specific > interface. *That* I think is daft. > > I was thinking a partition would be able to designate a house-keeping > sub-partition/mask, but who cares about all the various different > housekeeping parties. It's time for the isolation users to step up here! I very rarely hear from them and I just can't figure out by myself all the variants of uses for each of the isolation features. May be some people are only interested in nocb for some specific uses, or may be it never makes sense without nohz full and all the rest of the isolation features. So for now I take the very cautious path to split the interface. Thanks.
On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote: > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > > Hi, > > > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > > I am thinking along the line that it will not be hierarchical. However, > > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > > underneath the top cpuset with different isolation attributes, but no more > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > > > FWIW, I was under the impression that this would nicely fit along the > > > side of other feaures towards implenting dynamic isolation of CPUs (say > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > > for example). Wouldn't be awkward to have to poke different places to > > > achieve isolation at runtime? > > > > This, that's what I was thinking. > > > > My main objection to the whole thing is that it's an RCU_NOCB specific > > interface. *That* I think is daft. > > > > I was thinking a partition would be able to designate a house-keeping > > sub-partition/mask, but who cares about all the various different > > housekeeping parties. > > It's time for the isolation users to step up here! I very rarely hear from them > and I just can't figure out by myself all the variants of uses for each of the > isolation features. May be some people are only interested in nocb for some > specific uses, or may be it never makes sense without nohz full and all the rest > of the isolation features. So for now I take the very cautious path to split the > interface. This is ABI, you can't walk back on it. I would suggest starting with an 'all feature' isolation. Only if there's real demand for something more fine-grained add that on top. Simple first etc.
On Mon, May 30, 2022 at 10:11:41AM +0200, Peter Zijlstra wrote: > On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote: > > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > > > Hi, > > > > > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > > > I am thinking along the line that it will not be hierarchical. However, > > > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > > > underneath the top cpuset with different isolation attributes, but no more > > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > > > > > FWIW, I was under the impression that this would nicely fit along the > > > > side of other feaures towards implenting dynamic isolation of CPUs (say > > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > > > for example). Wouldn't be awkward to have to poke different places to > > > > achieve isolation at runtime? > > > > > > This, that's what I was thinking. > > > > > > My main objection to the whole thing is that it's an RCU_NOCB specific > > > interface. *That* I think is daft. > > > > > > I was thinking a partition would be able to designate a house-keeping > > > sub-partition/mask, but who cares about all the various different > > > housekeeping parties. > > > > It's time for the isolation users to step up here! I very rarely hear from them > > and I just can't figure out by myself all the variants of uses for each of the > > isolation features. May be some people are only interested in nocb for some > > specific uses, or may be it never makes sense without nohz full and all the rest > > of the isolation features. So for now I take the very cautious path to split the > > interface. > > This is ABI, you can't walk back on it. I would suggest starting with an > 'all feature' isolation. Only if there's real demand for something more > fine-grained add that on top. Simple first etc. That's actually my worry. If we start with an all in one ABI, how do we later mix that up with more finegrained features? Like what will be the behaviour of: cpuset.isolation.rcu_nocb = 0 cpuset.isolation.all = 1
On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote: > > This is ABI, you can't walk back on it. I would suggest starting with an > > 'all feature' isolation. Only if there's real demand for something more > > fine-grained add that on top. Simple first etc. > > That's actually my worry. If we start with an all in one ABI, how do we later > mix that up with more finegrained features? Like what will be the behaviour of: > > cpuset.isolation.rcu_nocb = 0 > cpuset.isolation.all = 1 Well clearly that doesn't make sense. I was more thinking along the lines of cgroup.subtree_control, where instead all features are enabled by default. But only if there's a real usecase, otherwise there's no point in providing such knobs.
On 30/05/22 15:16, Peter Zijlstra wrote: > On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote: > > > > This is ABI, you can't walk back on it. I would suggest starting with an > > > 'all feature' isolation. Only if there's real demand for something more > > > fine-grained add that on top. Simple first etc. > > > > That's actually my worry. If we start with an all in one ABI, how do we later > > mix that up with more finegrained features? Like what will be the behaviour of: > > > > cpuset.isolation.rcu_nocb = 0 > > cpuset.isolation.all = 1 > > Well clearly that doesn't make sense. I was more thinking along the > lines of cgroup.subtree_control, where instead all features are enabled > by default. > > But only if there's a real usecase, otherwise there's no point in > providing such knobs. All features on/off knob + house-keeping sub-partition/mask seem to be what isolation users I could reach so far (OCP mostly) would indeed like to have in the future.
On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote: > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > > Hi, > > > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > > I am thinking along the line that it will not be hierarchical. However, > > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > > underneath the top cpuset with different isolation attributes, but no more > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > > > FWIW, I was under the impression that this would nicely fit along the > > > side of other feaures towards implenting dynamic isolation of CPUs (say > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > > for example). Wouldn't be awkward to have to poke different places to > > > achieve isolation at runtime? > > > > This, that's what I was thinking. > > > > My main objection to the whole thing is that it's an RCU_NOCB specific > > interface. *That* I think is daft. > > > > I was thinking a partition would be able to designate a house-keeping > > sub-partition/mask, but who cares about all the various different > > housekeeping parties. > > It's time for the isolation users to step up here! I very rarely hear from them > and I just can't figure out by myself all the variants of uses for each of the > isolation features. May be some people are only interested in nocb for some > specific uses, or may be it never makes sense without nohz full and all the rest > of the isolation features. So for now I take the very cautious path to split the > interface. OK, my 2 cents. I personally deal with virtualisation setups that involve RT and CPU isolation on both host and guests. The main use-case ATM is running DPDK-like workloads. We want to achieve latencies in the order of tens of microseconds, so it's essential to avoid entering the kernel at all cost. So, no HW interrupts, sched tick, RCU callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc... Everything is deferred onto housekeeping CPUs or disabled. Then we have setups that need to deal with HW on the host, exposed to the guest through emulation or VirtIO. The same rules apply really, except for some IRQ affinity tweaks and sched priority magic. I find it hard to see how running RCU callback locally could be useful to any latency sensitive workload. Frederic, out of curiosity, do you have a use-case in mind that might benefit from nohz_full but not rcu_nocb? Maybe HPC? Regards, Nicolas
On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote: > On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote: > > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > > > Hi, > > > > > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > > > I am thinking along the line that it will not be hierarchical. However, > > > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > > > underneath the top cpuset with different isolation attributes, but no more > > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > > > > > FWIW, I was under the impression that this would nicely fit along the > > > > side of other feaures towards implenting dynamic isolation of CPUs (say > > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > > > for example). Wouldn't be awkward to have to poke different places to > > > > achieve isolation at runtime? > > > > > > This, that's what I was thinking. > > > > > > My main objection to the whole thing is that it's an RCU_NOCB specific > > > interface. *That* I think is daft. > > > > > > I was thinking a partition would be able to designate a house-keeping > > > sub-partition/mask, but who cares about all the various different > > > housekeeping parties. > > > > It's time for the isolation users to step up here! I very rarely hear from them > > and I just can't figure out by myself all the variants of uses for each of the > > isolation features. May be some people are only interested in nocb for some > > specific uses, or may be it never makes sense without nohz full and all the rest > > of the isolation features. So for now I take the very cautious path to split the > > interface. > > OK, my 2 cents. I personally deal with virtualisation setups that involve RT > and CPU isolation on both host and guests. > > The main use-case ATM is running DPDK-like workloads. We want to achieve > latencies in the order of tens of microseconds, so it's essential to avoid > entering the kernel at all cost. So, no HW interrupts, sched tick, RCU > callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc... > Everything is deferred onto housekeeping CPUs or disabled. > > Then we have setups that need to deal with HW on the host, exposed to the guest > through emulation or VirtIO. The same rules apply really, except for some IRQ > affinity tweaks and sched priority magic. > > I find it hard to see how running RCU callback locally could be useful to any > latency sensitive workload. > > Frederic, out of curiosity, do you have a use-case in mind that might benefit > from nohz_full but not rcu_nocb? Maybe HPC? Would users looking for millisecond-scale latencies want rcu_nocbs but not nohz_full, that is, the other way around? Thanx, Paul
On Mon, May 30, 2022 at 03:16:27PM +0200, Peter Zijlstra wrote: > On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote: > > > > This is ABI, you can't walk back on it. I would suggest starting with an > > > 'all feature' isolation. Only if there's real demand for something more > > > fine-grained add that on top. Simple first etc. > > > > That's actually my worry. If we start with an all in one ABI, how do we later > > mix that up with more finegrained features? Like what will be the behaviour of: > > > > cpuset.isolation.rcu_nocb = 0 > > cpuset.isolation.all = 1 > > Well clearly that doesn't make sense. I was more thinking along the > lines of cgroup.subtree_control, where instead all features are enabled > by default. > > But only if there's a real usecase, otherwise there's no point in > providing such knobs. That makes sense. So there would be a simple cpuset.isolation that can be either 1 or 0 where 1 has all possible isolation stuff on. Then if the need arises we can provide more tuning through a new specific cgroup controller, right? If so that sounds good to me.
> On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote: > > On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote: > > > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote: > > > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote: > > > > > Hi, > > > > > > > > > > On 26/05/22 14:37, Tejun Heo wrote: > > > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote: > > > > > > > I am thinking along the line that it will not be hierarchical. However, > > > > > > > cpuset can be useful if we want to have multiple isolated partitions > > > > > > > underneath the top cpuset with different isolation attributes, but no more > > > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only > > > > > > > set them at the first level under top_cpuset. Will that be useful? > > > > > > > > > > > > At that point, I'd just prefer to have it under /proc or /sys. > > > > > > > > > > FWIW, I was under the impression that this would nicely fit along the > > > > > side of other feaures towards implenting dynamic isolation of CPUs (say > > > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ > > > > > for example). Wouldn't be awkward to have to poke different places to > > > > > achieve isolation at runtime? > > > > > > > > This, that's what I was thinking. > > > > > > > > My main objection to the whole thing is that it's an RCU_NOCB specific > > > > interface. *That* I think is daft. > > > > > > > > I was thinking a partition would be able to designate a house-keeping > > > > sub-partition/mask, but who cares about all the various different > > > > housekeeping parties. > > > > > > It's time for the isolation users to step up here! I very rarely hear from them > > > and I just can't figure out by myself all the variants of uses for each of the > > > isolation features. May be some people are only interested in nocb for some > > > specific uses, or may be it never makes sense without nohz full and all the rest > > > of the isolation features. So for now I take the very cautious path to split the > > > interface. > > > > OK, my 2 cents. I personally deal with virtualisation setups that involve RT > > and CPU isolation on both host and guests. > > > > The main use-case ATM is running DPDK-like workloads. We want to achieve > > latencies in the order of tens of microseconds, so it's essential to avoid > > entering the kernel at all cost. So, no HW interrupts, sched tick, RCU > > callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc... > > Everything is deferred onto housekeeping CPUs or disabled. > > > > Then we have setups that need to deal with HW on the host, exposed to the guest > > through emulation or VirtIO. The same rules apply really, except for some IRQ > > affinity tweaks and sched priority magic. > > > > I find it hard to see how running RCU callback locally could be useful to any > > latency sensitive workload. > > > > Frederic, out of curiosity, do you have a use-case in mind that might benefit > > from nohz_full but not rcu_nocb? Maybe HPC? On Mon, May 30, 2022 at 8:42 AM Paul E. McKenney <paulmck@kernel.org> wrote: > Would users looking for millisecond-scale latencies want rcu_nocbs but > not nohz_full, that is, the other way around? On Intel processors running 5.15 with the timersd patches from Siewior backported and Weisbecker's bug-fix, choosing CONFIG_HZ_PERIODIC prevents cores from entering deeper C-states when hrtimers are pending. With CONFIG_NO_HZ_COMMON, the cores do not service non-blocking timer callbacks until another thread wakes them. Since low latency is critical, this is a use case where NO_HZ_FULL will not work but rcu_nocbs is needed. -- Alison Chaiken, Aurora Innovation
On Mon, May 30, 2022 at 11:35:56PM +0200, Frederic Weisbecker wrote: > That makes sense. So there would be a simple cpuset.isolation that can > be either 1 or 0 where 1 has all possible isolation stuff on. Then > if the need arises we can provide more tuning through a new specific > cgroup controller, right? Given that there isn't much that is hierarchical about them, I'm pretty skeptical about introducing a new controller or fancy hierarchical interface for it. If isolation is intertwined with cpuset partitioning and a simple knob for it fits well with the rest of configuration, yeah, but let's please try to avoid maximizing the interface. We want the interface to encode users' intentions (e.g., here, I want these cpus isolated) not the implementation details to make that happen. Of course, there are gradients but it becomes really ugly when you try to expose low level details on cgroups because of the implied flexibility (I can organize however I want hierarchically and the controls must nest and be delegatable properly). So, If you think isolation feature will need lots of low level knobs exposed, cgroup isn't the right place. It should be something simpler and lower level. This probably is a good time to spend some time thinking how it'd look like, say, five years down the line. If it's gonna be the "I want isolation" knob + maybe some obscure system wide knobs that most people don't need to think about, it's gonna be fine. Otherwise, we shouldn't put this in cgroup until we have better ideas on what the interface should look like. Thanks.
On 5/30/22 09:16, Peter Zijlstra wrote: > On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote: > >>> This is ABI, you can't walk back on it. I would suggest starting with an >>> 'all feature' isolation. Only if there's real demand for something more >>> fine-grained add that on top. Simple first etc. >> That's actually my worry. If we start with an all in one ABI, how do we later >> mix that up with more finegrained features? Like what will be the behaviour of: >> >> cpuset.isolation.rcu_nocb = 0 >> cpuset.isolation.all = 1 > Well clearly that doesn't make sense. I was more thinking along the > lines of cgroup.subtree_control, where instead all features are enabled > by default. > > But only if there's a real usecase, otherwise there's no point in > providing such knobs. I am actually thinking about extending the cpuset partition interface for isolation. Right now, I have an outstanding patch [1] to add an "isolated" state to partition which disable load balancing somewhat similar to isolcpus command line option. In the future, we can add attribute to the isolation state like "isolated:full" to similar to nohz_full currently. If the needs arise, we can evenĀ extend the attribute to allow list like "isolated:rcu_nocbs". I don't think it is good idea to keep on adding new cpuset control files extensively. I would prefer extending the existing ones. [1] https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/ Cheers, Longman
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 9390bfd9f1cd..2d9f019bb590 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -225,6 +225,7 @@ typedef enum { CS_SCHED_LOAD_BALANCE, CS_SPREAD_PAGE, CS_SPREAD_SLAB, + CS_RCU_NOCB, } cpuset_flagbits_t; /* convenient tests for these bits */ @@ -268,6 +269,11 @@ static inline int is_spread_slab(const struct cpuset *cs) return test_bit(CS_SPREAD_SLAB, &cs->flags); } +static inline int is_rcu_nocb(const struct cpuset *cs) +{ + return test_bit(CS_RCU_NOCB, &cs->flags); +} + static inline int is_partition_root(const struct cpuset *cs) { return cs->partition_root_state > 0; @@ -590,6 +596,62 @@ static inline void free_cpuset(struct cpuset *cs) kfree(cs); } +#ifdef CONFIG_RCU_NOCB_CPU +static int cpuset_rcu_nocb_apply(struct cpuset *root) +{ + int err; + + if (is_rcu_nocb(root)) + err = housekeeping_cpumask_set(root->effective_cpus, HK_TYPE_RCU); + else + err = housekeeping_cpumask_clear(root->effective_cpus, HK_TYPE_RCU); + + return err; +} + +static int cpuset_rcu_nocb_update(struct cpuset *cur, struct cpuset *trialcs) +{ + struct cgroup_subsys_state *des_css; + struct cpuset *des; + int err; + + if (cur->partition_root_state != PRS_ENABLED) + return -EINVAL; + + err = cpuset_rcu_nocb_apply(trialcs); + if (err < 0) + return err; + + rcu_read_lock(); + cpuset_for_each_descendant_pre(des, des_css, cur) { + if (des == cur) + continue; + if (des->partition_root_state == PRS_ENABLED) + break; + spin_lock_irq(&callback_lock); + if (is_rcu_nocb(trialcs)) + set_bit(CS_RCU_NOCB, &des->flags); + else + clear_bit(CS_RCU_NOCB, &des->flags); + spin_unlock_irq(&callback_lock); + } + rcu_read_unlock(); + + return 0; +} +#else +static inline int cpuset_rcu_nocb_apply(struct cpuset *root) +{ + return 0; +} + +static inline int cpuset_rcu_nocb_update(struct cpuset *cur, + struct cpuset *trialcs) +{ + return 0; +} +#endif /* #ifdef CONFIG_RCU_NOCB_CPU */ + /* * validate_change_legacy() - Validate conditions specific to legacy (v1) * behavior. @@ -1655,6 +1717,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, if (cs->partition_root_state) { struct cpuset *parent = parent_cs(cs); + WARN_ON_ONCE(cpuset_rcu_nocb_apply(parent) < 0); + WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0); + /* * For partition root, update the cpumasks of sibling * cpusets if they use parent's effective_cpus. @@ -2012,6 +2077,12 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs)) || (is_spread_page(cs) != is_spread_page(trialcs))); + if (is_rcu_nocb(cs) != is_rcu_nocb(trialcs)) { + err = cpuset_rcu_nocb_update(cs, trialcs); + if (err < 0) + goto out; + } + spin_lock_irq(&callback_lock); cs->flags = trialcs->flags; spin_unlock_irq(&callback_lock); @@ -2365,6 +2436,7 @@ typedef enum { FILE_MEMORY_PRESSURE, FILE_SPREAD_PAGE, FILE_SPREAD_SLAB, + FILE_RCU_NOCB, } cpuset_filetype_t; static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, @@ -2406,6 +2478,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft, case FILE_SPREAD_SLAB: retval = update_flag(CS_SPREAD_SLAB, cs, val); break; + case FILE_RCU_NOCB: + retval = update_flag(CS_RCU_NOCB, cs, val); + break; default: retval = -EINVAL; break; @@ -2573,6 +2648,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft) return is_spread_page(cs); case FILE_SPREAD_SLAB: return is_spread_slab(cs); + case FILE_RCU_NOCB: + return is_rcu_nocb(cs); default: BUG(); } @@ -2803,7 +2880,14 @@ static struct cftype dfl_files[] = { .private = FILE_SUBPARTS_CPULIST, .flags = CFTYPE_DEBUG, }, - +#ifdef CONFIG_RCU_NOCB_CPU + { + .name = "isolation.rcu_nocb", + .read_u64 = cpuset_read_u64, + .write_u64 = cpuset_write_u64, + .private = FILE_RCU_NOCB, + }, +#endif { } /* terminate */ }; @@ -2861,6 +2945,8 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) set_bit(CS_SPREAD_PAGE, &cs->flags); if (is_spread_slab(parent)) set_bit(CS_SPREAD_SLAB, &cs->flags); + if (is_rcu_nocb(parent)) + set_bit(CS_RCU_NOCB, &cs->flags); cpuset_inc(); @@ -3227,12 +3313,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) if (mems_updated) check_insane_mems_config(&new_mems); - if (is_in_v2_mode()) + if (is_in_v2_mode()) { hotplug_update_tasks(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); - else + if (cpus_updated) + WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0); + } else { hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems, cpus_updated, mems_updated); + } percpu_up_write(&cpuset_rwsem); }
Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset directory which provides support for a set of CPUs to either enable ("1") or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can overwrite previous boot settings towards "rcu_nocbs=" kernel parameter. The file is only writeable on "root" type partitions to exclude any overlap. The deepest root type partition has the highest priority. This means that given the following setting: Top cpuset (CPUs: 0-7) cpuset.isolation.rcu_nocb = 0 | | Subdirectory A (CPUs: 5-7) cpuset.cpus.partition = root cpuset.isolation.rcu_nocb = 0 | | Subdirectory B (CPUs: 7) cpuset.cpus.partition = root cpuset.isolation.rcu_nocb = 1 the result is that only CPU 7 is in rcu_nocb mode. Note that "rcu_nocbs" kernel parameter must be passed on boot, even without a cpulist, so that nocb support is enabled. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Phil Auld <pauld@redhat.com> Cc: Nicolas Saenz Julienne <nsaenz@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Waiman Long <longman@redhat.com> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> --- kernel/cgroup/cpuset.c | 95 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 92 insertions(+), 3 deletions(-)