diff mbox series

[RFC,4/4] cpuset: Support RCU-NOCB toggle on v2 root partitions

Message ID 20220525221055.1152307-5-frederic@kernel.org (mailing list archive)
State New, archived
Headers show
Series rcu/cpuset: Control RCU_NOCB offloading through cpusets | expand

Commit Message

Frederic Weisbecker May 25, 2022, 10:10 p.m. UTC
Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
directory which provides support for a set of CPUs to either enable ("1")
or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.

The file is only writeable on "root" type partitions to exclude any
overlap. The deepest root type partition has the highest priority.
This means that given the following setting:

                    Top cpuset (CPUs: 0-7)
                    cpuset.isolation.rcu_nocb = 0
                              |
                              |
                    Subdirectory A (CPUs: 5-7)
                    cpuset.cpus.partition = root
                    cpuset.isolation.rcu_nocb = 0
                              |
                              |
                    Subdirectory B (CPUs: 7)
                    cpuset.cpus.partition = root
                    cpuset.isolation.rcu_nocb = 1

the result is that only CPU 7 is in rcu_nocb mode.

Note that "rcu_nocbs" kernel parameter must be passed on boot, even
without a cpulist, so that nocb support is enabled.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Phil Auld <pauld@redhat.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Daniel Bristot de Oliveira <bristot@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 kernel/cgroup/cpuset.c | 95 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 92 insertions(+), 3 deletions(-)

Comments

Tejun Heo May 26, 2022, 6:21 p.m. UTC | #1
On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote:
> Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
> directory which provides support for a set of CPUs to either enable ("1")
> or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
> overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.
> 
> The file is only writeable on "root" type partitions to exclude any
> overlap. The deepest root type partition has the highest priority.
> This means that given the following setting:
> 
>                     Top cpuset (CPUs: 0-7)
>                     cpuset.isolation.rcu_nocb = 0
>                               |
>                               |
>                     Subdirectory A (CPUs: 5-7)
>                     cpuset.cpus.partition = root
>                     cpuset.isolation.rcu_nocb = 0
>                               |
>                               |
>                     Subdirectory B (CPUs: 7)
>                     cpuset.cpus.partition = root
>                     cpuset.isolation.rcu_nocb = 1
> 
> the result is that only CPU 7 is in rcu_nocb mode.
> 
> Note that "rcu_nocbs" kernel parameter must be passed on boot, even
> without a cpulist, so that nocb support is enabled.

Does it even make sense to make this hierarchical? What's wrong with a
cpumask under sys/ or proc/?

Thanks.
Frederic Weisbecker May 26, 2022, 10:51 p.m. UTC | #2
On Thu, May 26, 2022 at 08:21:13AM -1000, Tejun Heo wrote:
> On Thu, May 26, 2022 at 12:10:55AM +0200, Frederic Weisbecker wrote:
> > Introduce a new "isolation.rcu_nocb" file within a cgroup2/cpuset
> > directory which provides support for a set of CPUs to either enable ("1")
> > or disable ("0") RCU callbacks offloading (aka. RCU NOCB). This can
> > overwrite previous boot settings towards "rcu_nocbs=" kernel parameter.
> > 
> > The file is only writeable on "root" type partitions to exclude any
> > overlap. The deepest root type partition has the highest priority.
> > This means that given the following setting:
> > 
> >                     Top cpuset (CPUs: 0-7)
> >                     cpuset.isolation.rcu_nocb = 0
> >                               |
> >                               |
> >                     Subdirectory A (CPUs: 5-7)
> >                     cpuset.cpus.partition = root
> >                     cpuset.isolation.rcu_nocb = 0
> >                               |
> >                               |
> >                     Subdirectory B (CPUs: 7)
> >                     cpuset.cpus.partition = root
> >                     cpuset.isolation.rcu_nocb = 1
> > 
> > the result is that only CPU 7 is in rcu_nocb mode.
> > 
> > Note that "rcu_nocbs" kernel parameter must be passed on boot, even
> > without a cpulist, so that nocb support is enabled.
> 
> Does it even make sense to make this hierarchical? What's wrong with a
> cpumask under sys/ or proc/?

I'm usually told that cpusets is the current place where CPU attributes are
supposed to go. I personally don't mind much /sys either even though cpusets
looks like a more flexible way to partition CPUs with properties and tasks
placement altogether...
Tejun Heo May 26, 2022, 11:02 p.m. UTC | #3
On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote:
> > Does it even make sense to make this hierarchical? What's wrong with a
> > cpumask under sys/ or proc/?
> 
> I'm usually told that cpusets is the current place where CPU attributes are
> supposed to go. I personally don't mind much /sys either even though cpusets
> looks like a more flexible way to partition CPUs with properties and tasks
> placement altogether...

Yeah, I mean, if it's hierarchical, it's the right place but I have a hard
time seeing anything hierarchical with this one. Somebody just has to know
which cpus are up for rcu processing and which aren't. Waiman, what do you
think?

Thanks.
Waiman Long May 27, 2022, 12:28 a.m. UTC | #4
On 5/26/22 19:02, Tejun Heo wrote:
> On Fri, May 27, 2022 at 12:51:41AM +0200, Frederic Weisbecker wrote:
>>> Does it even make sense to make this hierarchical? What's wrong with a
>>> cpumask under sys/ or proc/?
>> I'm usually told that cpusets is the current place where CPU attributes are
>> supposed to go. I personally don't mind much /sys either even though cpusets
>> looks like a more flexible way to partition CPUs with properties and tasks
>> placement altogether...
> Yeah, I mean, if it's hierarchical, it's the right place but I have a hard
> time seeing anything hierarchical with this one. Somebody just has to know
> which cpus are up for rcu processing and which aren't. Waiman, what do you
> think?

I am thinking along the line that it will not be hierarchical. However, 
cpuset can be useful if we want to have multiple isolated partitions 
underneath the top cpuset with different isolation attributes, but no 
more sub-isolated partition with sub-attributes underneath them. IOW, we 
can only set them at the first level under top_cpuset. Will that be useful?

Cheers,
Longman
Tejun Heo May 27, 2022, 12:37 a.m. UTC | #5
On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> I am thinking along the line that it will not be hierarchical. However,
> cpuset can be useful if we want to have multiple isolated partitions
> underneath the top cpuset with different isolation attributes, but no more
> sub-isolated partition with sub-attributes underneath them. IOW, we can only
> set them at the first level under top_cpuset. Will that be useful?

At that point, I'd just prefer to have it under /proc or /sys.

Thanks.
Juri Lelli May 27, 2022, 8:30 a.m. UTC | #6
Hi,

On 26/05/22 14:37, Tejun Heo wrote:
> On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > I am thinking along the line that it will not be hierarchical. However,
> > cpuset can be useful if we want to have multiple isolated partitions
> > underneath the top cpuset with different isolation attributes, but no more
> > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > set them at the first level under top_cpuset. Will that be useful?
> 
> At that point, I'd just prefer to have it under /proc or /sys.

FWIW, I was under the impression that this would nicely fit along the
side of other feaures towards implenting dynamic isolation of CPUs (say
https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
for example). Wouldn't be awkward to have to poke different places to
achieve isolation at runtime?

Also, I wonder if a proc/sys interface might be problematic for certain
middleware that is substantially based on using cgroups. I'll try to ask
around. :)

Best,
Juri
Tejun Heo May 27, 2022, 8:45 a.m. UTC | #7
Hello,

On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> FWIW, I was under the impression that this would nicely fit along the
> side of other feaures towards implenting dynamic isolation of CPUs (say
> https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> for example). Wouldn't be awkward to have to poke different places to
> achieve isolation at runtime?

So, it were just being part of the isolated domain thing, it would make
sense, but as a separate flag which isn't hierarchical, it's weird to put it
there.

> Also, I wonder if a proc/sys interface might be problematic for certain
> middleware that is substantially based on using cgroups. I'll try to ask
> around. :)

There is a downside to making a feature a part of cpuset in that it makes
cgroup usage mandatory. This is fine for something which benefits from
hierarchical organization but it is weird to require building cgroup
hierarchy for straight-forward system-wide features.

Thanks.
Phil Auld May 27, 2022, 12:58 p.m. UTC | #8
Hi,

On Thu, May 26, 2022 at 10:45:00PM -1000 Tejun Heo wrote:
> Hello,
> 
> On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > FWIW, I was under the impression that this would nicely fit along the
> > side of other feaures towards implenting dynamic isolation of CPUs (say
> > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > for example). Wouldn't be awkward to have to poke different places to
> > achieve isolation at runtime?
> 
> So, it were just being part of the isolated domain thing, it would make
> sense, but as a separate flag which isn't hierarchical, it's weird to put it
> there.

The way I see it is more that the "isolated domain thing" is one part of
this whole dynamic isolation thing and is just one flag among many (most
still on the drawing board, but planned). It may be that Waiman's "isolated" 
should be renamed "no_load_balance" or something.

Part of this is making cpu isolation more granular. 

> 
> > Also, I wonder if a proc/sys interface might be problematic for certain
> > middleware that is substantially based on using cgroups. I'll try to ask
> > around. :)
> 
> There is a downside to making a feature a part of cpuset in that it makes
> cgroup usage mandatory. This is fine for something which benefits from
> hierarchical organization but it is weird to require building cgroup
> hierarchy for straight-forward system-wide features.
> 

That ship may have sailed when SD_LOAD_BALANCE was removed, which is part
of what Waiman's feature addresses.  That is, now in order to get control
over the system-wide feature of which CPUs get scheduler load balanced you 
need to use cpusets. 


My 3 cents anyway  (inflation ;)


Cheers,
Phil


> Thanks.
> 
> -- 
> tejun
> 

--
Peter Zijlstra May 28, 2022, 2:24 p.m. UTC | #9
On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> Hi,
> 
> On 26/05/22 14:37, Tejun Heo wrote:
> > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > I am thinking along the line that it will not be hierarchical. However,
> > > cpuset can be useful if we want to have multiple isolated partitions
> > > underneath the top cpuset with different isolation attributes, but no more
> > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > set them at the first level under top_cpuset. Will that be useful?
> > 
> > At that point, I'd just prefer to have it under /proc or /sys.
> 
> FWIW, I was under the impression that this would nicely fit along the
> side of other feaures towards implenting dynamic isolation of CPUs (say
> https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> for example). Wouldn't be awkward to have to poke different places to
> achieve isolation at runtime?

This, that's what I was thinking.

My main objection to the whole thing is that it's an RCU_NOCB specific
interface. *That* I think is daft.

I was thinking a partition would be able to designate a house-keeping
sub-partition/mask, but who cares about all the various different
housekeeping parties.
Frederic Weisbecker May 30, 2022, 12:40 a.m. UTC | #10
On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > Hi,
> > 
> > On 26/05/22 14:37, Tejun Heo wrote:
> > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > I am thinking along the line that it will not be hierarchical. However,
> > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > underneath the top cpuset with different isolation attributes, but no more
> > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > set them at the first level under top_cpuset. Will that be useful?
> > > 
> > > At that point, I'd just prefer to have it under /proc or /sys.
> > 
> > FWIW, I was under the impression that this would nicely fit along the
> > side of other feaures towards implenting dynamic isolation of CPUs (say
> > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > for example). Wouldn't be awkward to have to poke different places to
> > achieve isolation at runtime?
> 
> This, that's what I was thinking.
> 
> My main objection to the whole thing is that it's an RCU_NOCB specific
> interface. *That* I think is daft.
> 
> I was thinking a partition would be able to designate a house-keeping
> sub-partition/mask, but who cares about all the various different
> housekeeping parties.

It's time for the isolation users to step up here! I very rarely hear from them
and I just can't figure out by myself all the variants of uses for each of the
isolation features. May be some people are only interested in nocb for some
specific uses, or may be it never makes sense without nohz full and all the rest
of the isolation features. So for now I take the very cautious path to split the
interface.

Thanks.
Peter Zijlstra May 30, 2022, 8:11 a.m. UTC | #11
On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote:
> On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > 
> > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > 
> > > FWIW, I was under the impression that this would nicely fit along the
> > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > > for example). Wouldn't be awkward to have to poke different places to
> > > achieve isolation at runtime?
> > 
> > This, that's what I was thinking.
> > 
> > My main objection to the whole thing is that it's an RCU_NOCB specific
> > interface. *That* I think is daft.
> > 
> > I was thinking a partition would be able to designate a house-keeping
> > sub-partition/mask, but who cares about all the various different
> > housekeeping parties.
> 
> It's time for the isolation users to step up here! I very rarely hear from them
> and I just can't figure out by myself all the variants of uses for each of the
> isolation features. May be some people are only interested in nocb for some
> specific uses, or may be it never makes sense without nohz full and all the rest
> of the isolation features. So for now I take the very cautious path to split the
> interface.

This is ABI, you can't walk back on it. I would suggest starting with an
'all feature' isolation. Only if there's real demand for something more
fine-grained add that on top. Simple first etc.
Frederic Weisbecker May 30, 2022, 10:56 a.m. UTC | #12
On Mon, May 30, 2022 at 10:11:41AM +0200, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 02:40:49AM +0200, Frederic Weisbecker wrote:
> > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > Hi,
> > > > 
> > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > > 
> > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > > 
> > > > FWIW, I was under the impression that this would nicely fit along the
> > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > > > for example). Wouldn't be awkward to have to poke different places to
> > > > achieve isolation at runtime?
> > > 
> > > This, that's what I was thinking.
> > > 
> > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > interface. *That* I think is daft.
> > > 
> > > I was thinking a partition would be able to designate a house-keeping
> > > sub-partition/mask, but who cares about all the various different
> > > housekeeping parties.
> > 
> > It's time for the isolation users to step up here! I very rarely hear from them
> > and I just can't figure out by myself all the variants of uses for each of the
> > isolation features. May be some people are only interested in nocb for some
> > specific uses, or may be it never makes sense without nohz full and all the rest
> > of the isolation features. So for now I take the very cautious path to split the
> > interface.
> 
> This is ABI, you can't walk back on it. I would suggest starting with an
> 'all feature' isolation. Only if there's real demand for something more
> fine-grained add that on top. Simple first etc.

That's actually my worry. If we start with an all in one ABI, how do we later
mix that up with more finegrained features? Like what will be the behaviour of:

cpuset.isolation.rcu_nocb = 0
cpuset.isolation.all = 1
Peter Zijlstra May 30, 2022, 1:16 p.m. UTC | #13
On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:

> > This is ABI, you can't walk back on it. I would suggest starting with an
> > 'all feature' isolation. Only if there's real demand for something more
> > fine-grained add that on top. Simple first etc.
> 
> That's actually my worry. If we start with an all in one ABI, how do we later
> mix that up with more finegrained features? Like what will be the behaviour of:
> 
> cpuset.isolation.rcu_nocb = 0
> cpuset.isolation.all = 1

Well clearly that doesn't make sense. I was more thinking along the
lines of cgroup.subtree_control, where instead all features are enabled
by default.

But only if there's a real usecase, otherwise there's no point in
providing such knobs.
Juri Lelli May 30, 2022, 2:13 p.m. UTC | #14
On 30/05/22 15:16, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
> 
> > > This is ABI, you can't walk back on it. I would suggest starting with an
> > > 'all feature' isolation. Only if there's real demand for something more
> > > fine-grained add that on top. Simple first etc.
> > 
> > That's actually my worry. If we start with an all in one ABI, how do we later
> > mix that up with more finegrained features? Like what will be the behaviour of:
> > 
> > cpuset.isolation.rcu_nocb = 0
> > cpuset.isolation.all = 1
> 
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
> 
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.

All features on/off knob + house-keeping sub-partition/mask seem to be
what isolation users I could reach so far (OCP mostly) would indeed like
to have in the future.
Nicolas Saenz Julienne May 30, 2022, 2:29 p.m. UTC | #15
On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > 
> > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > 
> > > FWIW, I was under the impression that this would nicely fit along the
> > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > > for example). Wouldn't be awkward to have to poke different places to
> > > achieve isolation at runtime?
> > 
> > This, that's what I was thinking.
> > 
> > My main objection to the whole thing is that it's an RCU_NOCB specific
> > interface. *That* I think is daft.
> > 
> > I was thinking a partition would be able to designate a house-keeping
> > sub-partition/mask, but who cares about all the various different
> > housekeeping parties.
> 
> It's time for the isolation users to step up here! I very rarely hear from them
> and I just can't figure out by myself all the variants of uses for each of the
> isolation features. May be some people are only interested in nocb for some
> specific uses, or may be it never makes sense without nohz full and all the rest
> of the isolation features. So for now I take the very cautious path to split the
> interface.

OK, my 2 cents. I personally deal with virtualisation setups that involve RT
and CPU isolation on both host and guests.

The main use-case ATM is running DPDK-like workloads. We want to achieve
latencies in the order of tens of microseconds, so it's essential to avoid
entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
Everything is deferred onto housekeeping CPUs or disabled.

Then we have setups that need to deal with HW on the host, exposed to the guest
through emulation or VirtIO. The same rules apply really, except for some IRQ
affinity tweaks and sched priority magic.

I find it hard to see how running RCU callback locally could be useful to any
latency sensitive workload.

Frederic, out of curiosity, do you have a use-case in mind that might benefit
from nohz_full but not rcu_nocb? Maybe HPC?

Regards,
Nicolas
Paul E. McKenney May 30, 2022, 2:49 p.m. UTC | #16
On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote:
> On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > Hi,
> > > > 
> > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > > 
> > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > > 
> > > > FWIW, I was under the impression that this would nicely fit along the
> > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > > > for example). Wouldn't be awkward to have to poke different places to
> > > > achieve isolation at runtime?
> > > 
> > > This, that's what I was thinking.
> > > 
> > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > interface. *That* I think is daft.
> > > 
> > > I was thinking a partition would be able to designate a house-keeping
> > > sub-partition/mask, but who cares about all the various different
> > > housekeeping parties.
> > 
> > It's time for the isolation users to step up here! I very rarely hear from them
> > and I just can't figure out by myself all the variants of uses for each of the
> > isolation features. May be some people are only interested in nocb for some
> > specific uses, or may be it never makes sense without nohz full and all the rest
> > of the isolation features. So for now I take the very cautious path to split the
> > interface.
> 
> OK, my 2 cents. I personally deal with virtualisation setups that involve RT
> and CPU isolation on both host and guests.
> 
> The main use-case ATM is running DPDK-like workloads. We want to achieve
> latencies in the order of tens of microseconds, so it's essential to avoid
> entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
> callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
> Everything is deferred onto housekeeping CPUs or disabled.
> 
> Then we have setups that need to deal with HW on the host, exposed to the guest
> through emulation or VirtIO. The same rules apply really, except for some IRQ
> affinity tweaks and sched priority magic.
> 
> I find it hard to see how running RCU callback locally could be useful to any
> latency sensitive workload.
> 
> Frederic, out of curiosity, do you have a use-case in mind that might benefit
> from nohz_full but not rcu_nocb? Maybe HPC?

Would users looking for millisecond-scale latencies want rcu_nocbs but
not nohz_full, that is, the other way around?

							Thanx, Paul
Frederic Weisbecker May 30, 2022, 9:35 p.m. UTC | #17
On Mon, May 30, 2022 at 03:16:27PM +0200, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
> 
> > > This is ABI, you can't walk back on it. I would suggest starting with an
> > > 'all feature' isolation. Only if there's real demand for something more
> > > fine-grained add that on top. Simple first etc.
> > 
> > That's actually my worry. If we start with an all in one ABI, how do we later
> > mix that up with more finegrained features? Like what will be the behaviour of:
> > 
> > cpuset.isolation.rcu_nocb = 0
> > cpuset.isolation.all = 1
> 
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
> 
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.

That makes sense. So there would be a simple cpuset.isolation that can
be either 1 or 0 where 1 has all possible isolation stuff on. Then
if the need arises we can provide more tuning through a new specific
cgroup controller, right?

If so that sounds good to me.
Alison Chaiken May 30, 2022, 10:36 p.m. UTC | #18
> On Mon, May 30, 2022 at 04:29:56PM +0200, nicolas saenz julienne wrote:
> > On Mon, 2022-05-30 at 02:40 +0200, Frederic Weisbecker wrote:
> > > On Sat, May 28, 2022 at 04:24:50PM +0200, Peter Zijlstra wrote:
> > > > On Fri, May 27, 2022 at 10:30:18AM +0200, Juri Lelli wrote:
> > > > > Hi,
> > > > >
> > > > > On 26/05/22 14:37, Tejun Heo wrote:
> > > > > > On Thu, May 26, 2022 at 08:28:43PM -0400, Waiman Long wrote:
> > > > > > > I am thinking along the line that it will not be hierarchical. However,
> > > > > > > cpuset can be useful if we want to have multiple isolated partitions
> > > > > > > underneath the top cpuset with different isolation attributes, but no more
> > > > > > > sub-isolated partition with sub-attributes underneath them. IOW, we can only
> > > > > > > set them at the first level under top_cpuset. Will that be useful?
> > > > > >
> > > > > > At that point, I'd just prefer to have it under /proc or /sys.
> > > > >
> > > > > FWIW, I was under the impression that this would nicely fit along the
> > > > > side of other feaures towards implenting dynamic isolation of CPUs (say
> > > > > https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/
> > > > > for example). Wouldn't be awkward to have to poke different places to
> > > > > achieve isolation at runtime?
> > > >
> > > > This, that's what I was thinking.
> > > >
> > > > My main objection to the whole thing is that it's an RCU_NOCB specific
> > > > interface. *That* I think is daft.
> > > >
> > > > I was thinking a partition would be able to designate a house-keeping
> > > > sub-partition/mask, but who cares about all the various different
> > > > housekeeping parties.
> > >
> > > It's time for the isolation users to step up here! I very rarely hear from them
> > > and I just can't figure out by myself all the variants of uses for each of the
> > > isolation features. May be some people are only interested in nocb for some
> > > specific uses, or may be it never makes sense without nohz full and all the rest
> > > of the isolation features. So for now I take the very cautious path to split the
> > > interface.
> >
> > OK, my 2 cents. I personally deal with virtualisation setups that involve RT
> > and CPU isolation on both host and guests.
> >
> > The main use-case ATM is running DPDK-like workloads. We want to achieve
> > latencies in the order of tens of microseconds, so it's essential to avoid
> > entering the kernel at all cost. So, no HW interrupts, sched tick, RCU
> > callbacks, clocksource watchdogs, softlockup, intel_pstate, timers, etc...
> > Everything is deferred onto housekeeping CPUs or disabled.
> >
> > Then we have setups that need to deal with HW on the host, exposed to the guest
> > through emulation or VirtIO. The same rules apply really, except for some IRQ
> > affinity tweaks and sched priority magic.
> >
> > I find it hard to see how running RCU callback locally could be useful to any
> > latency sensitive workload.
> >
> > Frederic, out of curiosity, do you have a use-case in mind that might benefit
> > from nohz_full but not rcu_nocb? Maybe HPC?

On Mon, May 30, 2022 at 8:42 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> Would users looking for millisecond-scale latencies want rcu_nocbs but
> not nohz_full, that is, the other way around?

On Intel processors running 5.15 with the timersd patches from Siewior
backported and Weisbecker's bug-fix, choosing CONFIG_HZ_PERIODIC
prevents cores from entering deeper C-states when hrtimers are
pending.   With CONFIG_NO_HZ_COMMON, the cores do not service
non-blocking timer callbacks until another thread wakes them.  Since
low latency is critical, this is a use case where NO_HZ_FULL will not
work but rcu_nocbs is needed.

-- Alison Chaiken, Aurora Innovation
Tejun Heo May 31, 2022, 12:57 a.m. UTC | #19
On Mon, May 30, 2022 at 11:35:56PM +0200, Frederic Weisbecker wrote:
> That makes sense. So there would be a simple cpuset.isolation that can
> be either 1 or 0 where 1 has all possible isolation stuff on. Then
> if the need arises we can provide more tuning through a new specific
> cgroup controller, right?

Given that there isn't much that is hierarchical about them, I'm pretty
skeptical about introducing a new controller or fancy hierarchical interface
for it. If isolation is intertwined with cpuset partitioning and a simple
knob for it fits well with the rest of configuration, yeah, but let's please
try to avoid maximizing the interface. We want the interface to encode
users' intentions (e.g., here, I want these cpus isolated) not the
implementation details to make that happen. Of course, there are gradients
but it becomes really ugly when you try to expose low level details on
cgroups because of the implied flexibility (I can organize however I want
hierarchically and the controls must nest and be delegatable properly).

So, If you think isolation feature will need lots of low level knobs
exposed, cgroup isn't the right place. It should be something simpler and
lower level. This probably is a good time to spend some time thinking how
it'd look like, say, five years down the line. If it's gonna be the "I want
isolation" knob + maybe some obscure system wide knobs that most people
don't need to think about, it's gonna be fine. Otherwise, we shouldn't put
this in cgroup until we have better ideas on what the interface should look
like.

Thanks.
Waiman Long May 31, 2022, 2:21 p.m. UTC | #20
On 5/30/22 09:16, Peter Zijlstra wrote:
> On Mon, May 30, 2022 at 12:56:50PM +0200, Frederic Weisbecker wrote:
>
>>> This is ABI, you can't walk back on it. I would suggest starting with an
>>> 'all feature' isolation. Only if there's real demand for something more
>>> fine-grained add that on top. Simple first etc.
>> That's actually my worry. If we start with an all in one ABI, how do we later
>> mix that up with more finegrained features? Like what will be the behaviour of:
>>
>> cpuset.isolation.rcu_nocb = 0
>> cpuset.isolation.all = 1
> Well clearly that doesn't make sense. I was more thinking along the
> lines of cgroup.subtree_control, where instead all features are enabled
> by default.
>
> But only if there's a real usecase, otherwise there's no point in
> providing such knobs.

I am actually thinking about extending the cpuset partition interface 
for isolation. Right now, I have an outstanding patch [1] to add an 
"isolated" state to partition which disable load balancing somewhat 
similar to isolcpus command line option. In the future, we can add 
attribute to the isolation state like "isolated:full" to similar to 
nohz_full currently. If the needs arise, we can evenĀ  extend the 
attribute to allow list like "isolated:rcu_nocbs". I don't think it is 
good idea to keep on adding new cpuset control files extensively. I 
would prefer extending the existing ones.

[1] https://lore.kernel.org/lkml/20220510153413.400020-1-longman@redhat.com/

Cheers,
Longman
diff mbox series

Patch

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9390bfd9f1cd..2d9f019bb590 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -225,6 +225,7 @@  typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_RCU_NOCB,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -268,6 +269,11 @@  static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_rcu_nocb(const struct cpuset *cs)
+{
+	return test_bit(CS_RCU_NOCB, &cs->flags);
+}
+
 static inline int is_partition_root(const struct cpuset *cs)
 {
 	return cs->partition_root_state > 0;
@@ -590,6 +596,62 @@  static inline void free_cpuset(struct cpuset *cs)
 	kfree(cs);
 }
 
+#ifdef CONFIG_RCU_NOCB_CPU
+static int cpuset_rcu_nocb_apply(struct cpuset *root)
+{
+	int err;
+
+	if (is_rcu_nocb(root))
+		err = housekeeping_cpumask_set(root->effective_cpus, HK_TYPE_RCU);
+	else
+		err = housekeeping_cpumask_clear(root->effective_cpus, HK_TYPE_RCU);
+
+	return err;
+}
+
+static int cpuset_rcu_nocb_update(struct cpuset *cur, struct cpuset *trialcs)
+{
+	struct cgroup_subsys_state *des_css;
+	struct cpuset *des;
+	int err;
+
+	if (cur->partition_root_state != PRS_ENABLED)
+		return -EINVAL;
+
+	err = cpuset_rcu_nocb_apply(trialcs);
+	if (err < 0)
+		return err;
+
+	rcu_read_lock();
+	cpuset_for_each_descendant_pre(des, des_css, cur) {
+		if (des == cur)
+			continue;
+		if (des->partition_root_state == PRS_ENABLED)
+			break;
+		spin_lock_irq(&callback_lock);
+		if (is_rcu_nocb(trialcs))
+			set_bit(CS_RCU_NOCB, &des->flags);
+		else
+			clear_bit(CS_RCU_NOCB, &des->flags);
+		spin_unlock_irq(&callback_lock);
+	}
+	rcu_read_unlock();
+
+	return 0;
+}
+#else
+static inline int cpuset_rcu_nocb_apply(struct cpuset *root)
+{
+	return 0;
+}
+
+static inline int cpuset_rcu_nocb_update(struct cpuset *cur,
+					 struct cpuset *trialcs)
+{
+	return 0;
+}
+#endif /* #ifdef CONFIG_RCU_NOCB_CPU */
+
 /*
  * validate_change_legacy() - Validate conditions specific to legacy (v1)
  *                            behavior.
@@ -1655,6 +1717,9 @@  static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
 	if (cs->partition_root_state) {
 		struct cpuset *parent = parent_cs(cs);
 
+		WARN_ON_ONCE(cpuset_rcu_nocb_apply(parent) < 0);
+		WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0);
+
 		/*
 		 * For partition root, update the cpumasks of sibling
 		 * cpusets if they use parent's effective_cpus.
@@ -2012,6 +2077,12 @@  static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
 	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs)));
 
+	if (is_rcu_nocb(cs) != is_rcu_nocb(trialcs)) {
+		err = cpuset_rcu_nocb_update(cs, trialcs);
+		if (err < 0)
+			goto out;
+	}
+
 	spin_lock_irq(&callback_lock);
 	cs->flags = trialcs->flags;
 	spin_unlock_irq(&callback_lock);
@@ -2365,6 +2436,7 @@  typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_RCU_NOCB,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
@@ -2406,6 +2478,9 @@  static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+	case FILE_RCU_NOCB:
+		retval = update_flag(CS_RCU_NOCB, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -2573,6 +2648,8 @@  static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_RCU_NOCB:
+		return is_rcu_nocb(cs);
 	default:
 		BUG();
 	}
@@ -2803,7 +2880,14 @@  static struct cftype dfl_files[] = {
 		.private = FILE_SUBPARTS_CPULIST,
 		.flags = CFTYPE_DEBUG,
 	},
-
+#ifdef CONFIG_RCU_NOCB_CPU
+	{
+		.name = "isolation.rcu_nocb",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_RCU_NOCB,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -2861,6 +2945,8 @@  static int cpuset_css_online(struct cgroup_subsys_state *css)
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+	if (is_rcu_nocb(parent))
+		set_bit(CS_RCU_NOCB, &cs->flags);
 
 	cpuset_inc();
 
@@ -3227,12 +3313,15 @@  static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
 	if (mems_updated)
 		check_insane_mems_config(&new_mems);
 
-	if (is_in_v2_mode())
+	if (is_in_v2_mode()) {
 		hotplug_update_tasks(cs, &new_cpus, &new_mems,
 				     cpus_updated, mems_updated);
-	else
+		if (cpus_updated)
+			WARN_ON_ONCE(cpuset_rcu_nocb_apply(cs) < 0);
+	} else {
 		hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
 					    cpus_updated, mems_updated);
+	}
 
 	percpu_up_write(&cpuset_rwsem);
 }