mbox series

[v8,0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus

Message ID 20211018143619.205065-1-longman@redhat.com (mailing list archive)
Headers show
Series cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus | expand

Message

Waiman Long Oct. 18, 2021, 2:36 p.m. UTC
v8:
 - Reorganize the patch series and rationalize the features and
   constraints of a partition.
 - Update patch descriptions and documentation accordingly.

v7:
 - Simplify the documentation patch (patch 5) as suggested by Tejun.
 - Fix a typo in patch 2 and improper commit log in patch 3.

v6:
 - Remove duplicated tmpmask from update_prstate() which should fix the
   frame size too large problem reported by kernel test robot.

This patchset makes four enhancements to the cpuset v2 code.

 Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.

 Patch 2: Refining the features and constraints of a cpuset partition
 clarifying what changes are allowed.

 Patch 3: Add a new partition state "isolated" to create a partition
 root without load balancing. This is for handling intermitten workloads
 that have a strict low latency requirement.

 Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
 that causes invalid partition like "root invalid (No cpu available
 due to hotplug)".

Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
cpuset test to test the new cpuset partition code.

Waiman Long (6):
  cgroup/cpuset: Allow no-task partition to have empty
    cpuset.cpus.effective
  cgroup/cpuset: Refining features and constraints of a partition
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Update description of cpuset.cpus.partition in
    cgroup-v2.rst
  kselftest/cgroup: Add cpuset v2 partition root state test

 Documentation/admin-guide/cgroup-v2.rst       | 153 ++--
 kernel/cgroup/cpuset.c                        | 393 +++++++----
 tools/testing/selftests/cgroup/Makefile       |   5 +-
 .../selftests/cgroup/test_cpuset_prs.sh       | 664 ++++++++++++++++++
 tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
 5 files changed, 1115 insertions(+), 187 deletions(-)
 create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
 create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

Comments

Waiman Long Oct. 27, 2021, 11:05 p.m. UTC | #1
On 10/18/21 10:36 AM, Waiman Long wrote:
> v8:
>   - Reorganize the patch series and rationalize the features and
>     constraints of a partition.
>   - Update patch descriptions and documentation accordingly.
>
> v7:
>   - Simplify the documentation patch (patch 5) as suggested by Tejun.
>   - Fix a typo in patch 2 and improper commit log in patch 3.
>
> v6:
>   - Remove duplicated tmpmask from update_prstate() which should fix the
>     frame size too large problem reported by kernel test robot.
>
> This patchset makes four enhancements to the cpuset v2 code.
>
>   Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
>
>   Patch 2: Refining the features and constraints of a cpuset partition
>   clarifying what changes are allowed.
>
>   Patch 3: Add a new partition state "isolated" to create a partition
>   root without load balancing. This is for handling intermitten workloads
>   that have a strict low latency requirement.
>
>   Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
>   that causes invalid partition like "root invalid (No cpu available
>   due to hotplug)".
>
> Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> cpuset test to test the new cpuset partition code.
>
> Waiman Long (6):
>    cgroup/cpuset: Allow no-task partition to have empty
>      cpuset.cpus.effective
>    cgroup/cpuset: Refining features and constraints of a partition
>    cgroup/cpuset: Add a new isolated cpus.partition type
>    cgroup/cpuset: Show invalid partition reason string
>    cgroup/cpuset: Update description of cpuset.cpus.partition in
>      cgroup-v2.rst
>    kselftest/cgroup: Add cpuset v2 partition root state test
>
>   Documentation/admin-guide/cgroup-v2.rst       | 153 ++--
>   kernel/cgroup/cpuset.c                        | 393 +++++++----
>   tools/testing/selftests/cgroup/Makefile       |   5 +-
>   .../selftests/cgroup/test_cpuset_prs.sh       | 664 ++++++++++++++++++
>   tools/testing/selftests/cgroup/wait_inotify.c |  87 +++
>   5 files changed, 1115 insertions(+), 187 deletions(-)
>   create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
>   create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

Any feedback on this patch series?

Thanks,
Longman
MOESSBAUER, Felix Nov. 10, 2021, 11:13 a.m. UTC | #2
Hi Weiman,

> v8:
>  - Reorganize the patch series and rationalize the features and
>    constraints of a partition.
>  - Update patch descriptions and documentation accordingly.
> 
> v7:
>  - Simplify the documentation patch (patch 5) as suggested by Tejun.
>  - Fix a typo in patch 2 and improper commit log in patch 3.
> 
> v6:
>  - Remove duplicated tmpmask from update_prstate() which should fix the
>    frame size too large problem reported by kernel test robot.
> 
> This patchset makes four enhancements to the cpuset v2 code.
> 
>  Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
> 
>  Patch 2: Refining the features and constraints of a cpuset partition
>  clarifying what changes are allowed.
>
>  Patch 3: Add a new partition state "isolated" to create a partition
>  root without load balancing. This is for handling intermitten workloads
>  that have a strict low latency requirement.


I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).

However, I was not able to see any latency improvements when using
cpuset.cpus.partition=isolated.
The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
On the other cpus, stress-ng was executed to generate load.

Just some more general notes:

Even with this new "isolated" type, it is still very tricky to get a similar
behavior as with isolcpus (as long as I don't miss something here):

Consider an RT application that consists of a non-rt thread that should be floating
and a rt-thread that should be placed in the isolated domain.
This requires cgroup.type=threaded on both cgroups and changes to the application
(threads have to be born in non-rt group and moved to rt-group).

Theoretically, this could be done externally, but in case the application sets the
affinity mask manually, you run into a timing issue (setting affinities to CPUs
outside the current cpuset.cpus results in EINVAL).

Best regards,
Felix Moessbauer
Siemens AG

> Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
>  that causes invalid partition like "root invalid (No cpu available
>  due to hotplug)".
> 
> Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> cpuset test to test the new cpuset partition code.
Marcelo Tosatti Nov. 10, 2021, 1:21 p.m. UTC | #3
On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer wrote:
> Hi Weiman,
> 
> > v8:
> >  - Reorganize the patch series and rationalize the features and
> >    constraints of a partition.
> >  - Update patch descriptions and documentation accordingly.
> > 
> > v7:
> >  - Simplify the documentation patch (patch 5) as suggested by Tejun.
> >  - Fix a typo in patch 2 and improper commit log in patch 3.
> > 
> > v6:
> >  - Remove duplicated tmpmask from update_prstate() which should fix the
> >    frame size too large problem reported by kernel test robot.
> > 
> > This patchset makes four enhancements to the cpuset v2 code.
> > 
> >  Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
> > 
> >  Patch 2: Refining the features and constraints of a cpuset partition
> >  clarifying what changes are allowed.
> >
> >  Patch 3: Add a new partition state "isolated" to create a partition
> >  root without load balancing. This is for handling intermitten workloads
> >  that have a strict low latency requirement.
> 
> 
> I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).
> 
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.
> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.

enum hk_flags {
        HK_FLAG_TIMER           = 1,
        HK_FLAG_RCU             = (1 << 1),
        HK_FLAG_MISC            = (1 << 2),
        HK_FLAG_SCHED           = (1 << 3),
        HK_FLAG_TICK            = (1 << 4),
        HK_FLAG_DOMAIN          = (1 << 5),
        HK_FLAG_WQ              = (1 << 6),
        HK_FLAG_MANAGED_IRQ     = (1 << 7),
        HK_FLAG_KTHREAD         = (1 << 8),
};

static int __init housekeeping_nohz_full_setup(char *str)
{
        unsigned int flags;

        flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
                HK_FLAG_MISC | HK_FLAG_KTHREAD;

        return housekeeping_setup(str, flags);
}
__setup("nohz_full=", housekeeping_nohz_full_setup);

So HK_FLAG_SCHED and HK_FLAG_MANAGED_IRQ are unset in your configuration.
Perhaps they are affecting your latency numbers?

This tool might be handy to see what is the reason for the latency source:

https://github.com/xzpeter/rt-trace-bpf

./rt-trace-bcc.py -c isolated-cpu

> Just some more general notes:
> 
> Even with this new "isolated" type, it is still very tricky to get a similar
> behavior as with isolcpus (as long as I don't miss something here):
> 
> Consider an RT application that consists of a non-rt thread that should be floating
> and a rt-thread that should be placed in the isolated domain.
> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).
> 
> Theoretically, this could be done externally, but in case the application sets the
> affinity mask manually, you run into a timing issue (setting affinities to CPUs
> outside the current cpuset.cpus results in EINVAL).
> 
> Best regards,
> Felix Moessbauer
> Siemens AG
> 
> > Patch 4: Enable the "cpuset.cpus.partition" file to show the reason
> >  that causes invalid partition like "root invalid (No cpu available
> >  due to hotplug)".
> > 
> > Patch 5 updates the cgroup-v2.rst file accordingly. Patch 6 adds a new
> > cpuset test to test the new cpuset partition code.
> 
>
Michal Koutný Nov. 10, 2021, 1:56 p.m. UTC | #4
Hello.

On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer <felix.moessbauer@siemens.com> wrote:
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.

Interesting. What was the baseline against which you compared it
(isolcpus, no cpusets,...)?

> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.
> [...]

> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).

But even with isolcpus the application would need to set affinity of
threads to the selected CPUs (cf cgroup migrating). Do I miss anything?

Thanks,
Michal
Waiman Long Nov. 10, 2021, 3:20 p.m. UTC | #5
On 11/10/21 06:13, Felix Moessbauer wrote:
> Hi Weiman,
>
>> v8:
>>   - Reorganize the patch series and rationalize the features and
>>     constraints of a partition.
>>   - Update patch descriptions and documentation accordingly.
>>
>> v7:
>>   - Simplify the documentation patch (patch 5) as suggested by Tejun.
>>   - Fix a typo in patch 2 and improper commit log in patch 3.
>>
>> v6:
>>   - Remove duplicated tmpmask from update_prstate() which should fix the
>>     frame size too large problem reported by kernel test robot.
>>
>> This patchset makes four enhancements to the cpuset v2 code.
>>
>>   Patch 1: Enable partition with no task to have empty cpuset.cpus.effective.
>>
>>   Patch 2: Refining the features and constraints of a cpuset partition
>>   clarifying what changes are allowed.
>>
>>   Patch 3: Add a new partition state "isolated" to create a partition
>>   root without load balancing. This is for handling intermitten workloads
>>   that have a strict low latency requirement.
>
> I just tested this patch-series and can confirm that it works on 5.15.0-rc7-rt15 (PREEMT_RT).
>
> However, I was not able to see any latency improvements when using
> cpuset.cpus.partition=isolated.
> The test was performed with jitterdebugger on CPUs 1-3 and the following cmdline:
> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> On the other cpus, stress-ng was executed to generate load.
>
> Just some more general notes:
>
> Even with this new "isolated" type, it is still very tricky to get a similar
> behavior as with isolcpus (as long as I don't miss something here):
>
> Consider an RT application that consists of a non-rt thread that should be floating
> and a rt-thread that should be placed in the isolated domain.
> This requires cgroup.type=threaded on both cgroups and changes to the application
> (threads have to be born in non-rt group and moved to rt-group).
>
> Theoretically, this could be done externally, but in case the application sets the
> affinity mask manually, you run into a timing issue (setting affinities to CPUs
> outside the current cpuset.cpus results in EINVAL).

I believe the "isolated" type will have more benefit on non PREEMPT_RT 
kernel. Anyway, having the "isolated" type is just the first step. It 
should be equivalent to "isolcpus=domain". There are other patches 
floating that attempt to move some of the isolcpus=nohz features into 
cpuset as well. It is not there yet, but we should be able to have 
better dynamic cpu isolation down the road.

Cheers,
Longman
MOESSBAUER, Felix Nov. 10, 2021, 3:21 p.m. UTC | #6
> -----Original Message-----
> From: Michal Koutný <mkoutny@suse.com>
> Sent: Wednesday, November 10, 2021 2:57 PM
> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> Cc: longman@redhat.com; akpm@linux-foundation.org;
> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> <henning.schild@siemens.com>
> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> empty effecitve cpus
> 
> Hello.
> 
> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> <felix.moessbauer@siemens.com> wrote:
> > However, I was not able to see any latency improvements when using
> > cpuset.cpus.partition=isolated.
> 
> Interesting. What was the baseline against which you compared it (isolcpus, no
> cpusets,...)?

For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).

> 
> > The test was performed with jitterdebugger on CPUs 1-3 and the following
> cmdline:
> > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > On the other cpus, stress-ng was executed to generate load.
> > [...]
> 
> > This requires cgroup.type=threaded on both cgroups and changes to the
> > application (threads have to be born in non-rt group and moved to rt-group).
> 
> But even with isolcpus the application would need to set affinity of threads to
> the selected CPUs (cf cgroup migrating). Do I miss anything?

Yes, that's true. But there are two differences (given that you use isolcpus):
1. the application only has to set the affinity for rt threads.
 Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
 Even common rt test applications like jitterdebugger do not pin their non-rt threads.
2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
This binding can be specified before thread creation via pthread_create.
By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.

With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.

Best regards,
Felix

> 
> Thanks,
> Michal
Marcelo Tosatti Nov. 10, 2021, 4:10 p.m. UTC | #7
On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> 
> 
> > -----Original Message-----
> > From: Michal Koutný <mkoutny@suse.com>
> > Sent: Wednesday, November 10, 2021 2:57 PM
> > To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> > Cc: longman@redhat.com; akpm@linux-foundation.org;
> > cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> > hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> > kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> > lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> > peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> > IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> > <henning.schild@siemens.com>
> > Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> > empty effecitve cpus
> > 
> > Hello.
> > 
> > On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> > <felix.moessbauer@siemens.com> wrote:
> > > However, I was not able to see any latency improvements when using
> > > cpuset.cpus.partition=isolated.
> > 
> > Interesting. What was the baseline against which you compared it (isolcpus, no
> > cpusets,...)?
> 
> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> 
> > 
> > > The test was performed with jitterdebugger on CPUs 1-3 and the following
> > cmdline:
> > > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > > On the other cpus, stress-ng was executed to generate load.
> > > [...]
> > 
> > > This requires cgroup.type=threaded on both cgroups and changes to the
> > > application (threads have to be born in non-rt group and moved to rt-group).
> > 
> > But even with isolcpus the application would need to set affinity of threads to
> > the selected CPUs (cf cgroup migrating). Do I miss anything?
> 
> Yes, that's true. But there are two differences (given that you use isolcpus):
> 1. the application only has to set the affinity for rt threads.
>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> This binding can be specified before thread creation via pthread_create.
> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> 
> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.

man clone3:

       CLONE_NEWCGROUP (since Linux 4.6)
              Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
              process is created in the same cgroup namespaces as the calling process.

              For further information on cgroup namespaces, see cgroup_namespaces(7).

              Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
Marcelo Tosatti Nov. 10, 2021, 4:14 p.m. UTC | #8
On Wed, Nov 10, 2021 at 01:10:20PM -0300, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Michal Koutný <mkoutny@suse.com>
> > > Sent: Wednesday, November 10, 2021 2:57 PM
> > > To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> > > Cc: longman@redhat.com; akpm@linux-foundation.org;
> > > cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> > > hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> > > kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> > > lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> > > peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> > > IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> > > <henning.schild@siemens.com>
> > > Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> > > empty effecitve cpus
> > > 
> > > Hello.
> > > 
> > > On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> > > <felix.moessbauer@siemens.com> wrote:
> > > > However, I was not able to see any latency improvements when using
> > > > cpuset.cpus.partition=isolated.
> > > 
> > > Interesting. What was the baseline against which you compared it (isolcpus, no
> > > cpusets,...)?
> > 
> > For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> > There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> > 
> > > 
> > > > The test was performed with jitterdebugger on CPUs 1-3 and the following
> > > cmdline:
> > > > rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> > > > On the other cpus, stress-ng was executed to generate load.
> > > > [...]
> > > 
> > > > This requires cgroup.type=threaded on both cgroups and changes to the
> > > > application (threads have to be born in non-rt group and moved to rt-group).
> > > 
> > > But even with isolcpus the application would need to set affinity of threads to
> > > the selected CPUs (cf cgroup migrating). Do I miss anything?
> > 
> > Yes, that's true. But there are two differences (given that you use isolcpus):
> > 1. the application only has to set the affinity for rt threads.
> >  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
> >  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> > 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> > This binding can be specified before thread creation via pthread_create.
> > By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> > 
> > With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> > Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> > At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> > Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> 
> man clone3:
> 
>        CLONE_NEWCGROUP (since Linux 4.6)
>               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>               process is created in the same cgroup namespaces as the calling process.
> 
>               For further information on cgroup namespaces, see cgroup_namespaces(7).
> 
>               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> 

Err, CLONE_INTO_CGROUP.
Jan Kiszka Nov. 10, 2021, 4:15 p.m. UTC | #9
On 10.11.21 17:10, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
>>
>>
>>> -----Original Message-----
>>> From: Michal Koutný <mkoutny@suse.com>
>>> Sent: Wednesday, November 10, 2021 2:57 PM
>>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
>>> Cc: longman@redhat.com; akpm@linux-foundation.org;
>>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
>>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
>>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
>>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
>>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
>>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
>>> <henning.schild@siemens.com>
>>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
>>> empty effecitve cpus
>>>
>>> Hello.
>>>
>>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
>>> <felix.moessbauer@siemens.com> wrote:
>>>> However, I was not able to see any latency improvements when using
>>>> cpuset.cpus.partition=isolated.
>>>
>>> Interesting. What was the baseline against which you compared it (isolcpus, no
>>> cpusets,...)?
>>
>> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
>> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
>>
>>>
>>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
>>> cmdline:
>>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
>>>> On the other cpus, stress-ng was executed to generate load.
>>>> [...]
>>>
>>>> This requires cgroup.type=threaded on both cgroups and changes to the
>>>> application (threads have to be born in non-rt group and moved to rt-group).
>>>
>>> But even with isolcpus the application would need to set affinity of threads to
>>> the selected CPUs (cf cgroup migrating). Do I miss anything?
>>
>> Yes, that's true. But there are two differences (given that you use isolcpus):
>> 1. the application only has to set the affinity for rt threads.
>>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
>> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
>> This binding can be specified before thread creation via pthread_create.
>> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
>>
>> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
>> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
>> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
>> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> 
> man clone3:
> 
>        CLONE_NEWCGROUP (since Linux 4.6)
>               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>               process is created in the same cgroup namespaces as the calling process.
> 
>               For further information on cgroup namespaces, see cgroup_namespaces(7).
> 
>               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> 

Is there pthread_attr_setcgroup_np()?

Jan
Marcelo Tosatti Nov. 10, 2021, 5:29 p.m. UTC | #10
On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka wrote:
> On 10.11.21 17:10, Marcelo Tosatti wrote:
> > On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Michal Koutný <mkoutny@suse.com>
> >>> Sent: Wednesday, November 10, 2021 2:57 PM
> >>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
> >>> Cc: longman@redhat.com; akpm@linux-foundation.org;
> >>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
> >>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
> >>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
> >>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
> >>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
> >>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
> >>> <henning.schild@siemens.com>
> >>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
> >>> empty effecitve cpus
> >>>
> >>> Hello.
> >>>
> >>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
> >>> <felix.moessbauer@siemens.com> wrote:
> >>>> However, I was not able to see any latency improvements when using
> >>>> cpuset.cpus.partition=isolated.
> >>>
> >>> Interesting. What was the baseline against which you compared it (isolcpus, no
> >>> cpusets,...)?
> >>
> >> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
> >> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
> >>
> >>>
> >>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
> >>> cmdline:
> >>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
> >>>> On the other cpus, stress-ng was executed to generate load.
> >>>> [...]
> >>>
> >>>> This requires cgroup.type=threaded on both cgroups and changes to the
> >>>> application (threads have to be born in non-rt group and moved to rt-group).
> >>>
> >>> But even with isolcpus the application would need to set affinity of threads to
> >>> the selected CPUs (cf cgroup migrating). Do I miss anything?
> >>
> >> Yes, that's true. But there are two differences (given that you use isolcpus):
> >> 1. the application only has to set the affinity for rt threads.
> >>  Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
> >>  Even common rt test applications like jitterdebugger do not pin their non-rt threads.
> >> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> >> This binding can be specified before thread creation via pthread_create.
> >> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
> >>
> >> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
> >> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
> >> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
> >> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
> > 
> > man clone3:
> > 
> >        CLONE_NEWCGROUP (since Linux 4.6)
> >               Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
> >               process is created in the same cgroup namespaces as the calling process.
> > 
> >               For further information on cgroup namespaces, see cgroup_namespaces(7).
> > 
> >               Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
> > 
> 
> Is there pthread_attr_setcgroup_np()?
> 
> Jan

Don't know... Waiman?
Michal Koutný Nov. 10, 2021, 5:52 p.m. UTC | #11
On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> Is there pthread_attr_setcgroup_np()?

If I'm not mistaken the 'p' in pthreads stands for POSIX and cgroups are
Linux specific so you won't find that (unless you implement that
yourself). ¯\_(ツ)_/¯

Michal
Jan Kiszka Nov. 10, 2021, 6:04 p.m. UTC | #12
On 10.11.21 18:52, Michal Koutný wrote:
> On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka <jan.kiszka@siemens.com> wrote:
>> Is there pthread_attr_setcgroup_np()?
> 
> If I'm not mistaken the 'p' in pthreads stands for POSIX and cgroups are
> Linux specific so you won't find that (unless you implement that
> yourself). ¯\_(ツ)_/¯
> 

I know what it stands for :). But I don't want to re-implement pthreads
just to have a single creation-time configurable injected. Neither would
developer of standard application, e.g. libvirt for the rt-kvm special
case while most of their use cases are fine with regular pthread APIs. I
think there is also a demand for a programming model that fits into
existing ones.

Jan
Michal Koutný Nov. 10, 2021, 6:15 p.m. UTC | #13
On Wed, Nov 10, 2021 at 03:21:54PM +0000, "Moessbauer, Felix" <felix.moessbauer@siemens.com> wrote:
> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
> This binding can be specified before thread creation via pthread_create.
> By that, you can make sure that at no point in time a thread has a
> "forbidden" CPU in its affinities.

It should boil down to some clone$version(2) and sched_setaffinity(2)
calls, so strictly speaking even with pthread_create(3) the thread is
shortly running with the parent's affinity.

> With cgroup2, you cannot guarantee the second aspect, as thread
> creation and moving to a cgroup is not an atomic operation.

As suggested by others, CLONE_INTO_CGROUP (into cpuset cgroup) can
actually "hide" the migration into the clone3() call.

> At creation time, you cannot set the final affinity mask (as you
> create it in the non-rt group and there the CPU is not in the
> cpuset.cpus).
> Once you move the thread to the rt cgroup, it has a default mask and
> by that can be executed on other rt cores.

Good point. Perhaps you could work this around by having another level
of (non-root partition) cpuset cgroups for individual CPUs? (Maybe
there's more clever approach, this is just first to come into my mind.)

Michal
Waiman Long Nov. 10, 2021, 6:30 p.m. UTC | #14
On 11/10/21 12:29, Marcelo Tosatti wrote:
> On Wed, Nov 10, 2021 at 05:15:41PM +0100, Jan Kiszka wrote:
>> On 10.11.21 17:10, Marcelo Tosatti wrote:
>>> On Wed, Nov 10, 2021 at 03:21:54PM +0000, Moessbauer, Felix wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Michal Koutný <mkoutny@suse.com>
>>>>> Sent: Wednesday, November 10, 2021 2:57 PM
>>>>> To: Moessbauer, Felix (T RDA IOT SES-DE) <felix.moessbauer@siemens.com>
>>>>> Cc: longman@redhat.com; akpm@linux-foundation.org;
>>>>> cgroups@vger.kernel.org; corbet@lwn.net; frederic@kernel.org; guro@fb.com;
>>>>> hannes@cmpxchg.org; juri.lelli@redhat.com; linux-doc@vger.kernel.org; linux-
>>>>> kernel@vger.kernel.org; linux-kselftest@vger.kernel.org;
>>>>> lizefan.x@bytedance.com; mtosatti@redhat.com; pauld@redhat.com;
>>>>> peterz@infradead.org; shuah@kernel.org; tj@kernel.org; Kiszka, Jan (T RDA
>>>>> IOT) <jan.kiszka@siemens.com>; Schild, Henning (T RDA IOT SES-DE)
>>>>> <henning.schild@siemens.com>
>>>>> Subject: Re: [PATCH v8 0/6] cgroup/cpuset: Add new cpuset partition type &
>>>>> empty effecitve cpus
>>>>>
>>>>> Hello.
>>>>>
>>>>> On Wed, Nov 10, 2021 at 12:13:57PM +0100, Felix Moessbauer
>>>>> <felix.moessbauer@siemens.com> wrote:
>>>>>> However, I was not able to see any latency improvements when using
>>>>>> cpuset.cpus.partition=isolated.
>>>>> Interesting. What was the baseline against which you compared it (isolcpus, no
>>>>> cpusets,...)?
>>>> For this test, I just compared both settings cpuset.cpus.partition=isolated|root.
>>>> There, I did not see a significant difference (but I know, RT tuning depends on a ton of things).
>>>>
>>>>>> The test was performed with jitterdebugger on CPUs 1-3 and the following
>>>>> cmdline:
>>>>>> rcu_nocbs=1-4 nohz_full=1-4 irqaffinity=0,5-6,11 intel_pstate=disable
>>>>>> On the other cpus, stress-ng was executed to generate load.
>>>>>> [...]
>>>>>> This requires cgroup.type=threaded on both cgroups and changes to the
>>>>>> application (threads have to be born in non-rt group and moved to rt-group).
>>>>> But even with isolcpus the application would need to set affinity of threads to
>>>>> the selected CPUs (cf cgroup migrating). Do I miss anything?
>>>> Yes, that's true. But there are two differences (given that you use isolcpus):
>>>> 1. the application only has to set the affinity for rt threads.
>>>>   Threads that do not explicitly set the affinity are automatically excluded from the isolated cores.
>>>>   Even common rt test applications like jitterdebugger do not pin their non-rt threads.
>>>> 2. Threads can be started on non-rt CPUs and then bound to a specific rt CPU.
>>>> This binding can be specified before thread creation via pthread_create.
>>>> By that, you can make sure that at no point in time a thread has a "forbidden" CPU in its affinities.
>>>>
>>>> With cgroup2, you cannot guarantee the second aspect, as thread creation and moving to a cgroup is not an atomic operation.
>>>> Also - please correct me if I'm wrong - you first have to create a thread before moving it into a group.
>>>> At creation time, you cannot set the final affinity mask (as you create it in the non-rt group and there the CPU is not in the cpuset.cpus).
>>>> Once you move the thread to the rt cgroup, it has a default mask and by that can be executed on other rt cores.
>>> man clone3:
>>>
>>>         CLONE_NEWCGROUP (since Linux 4.6)
>>>                Create  the  process  in  a  new cgroup namespace.  If this flag is not set, then (as with fork(2)) the
>>>                process is created in the same cgroup namespaces as the calling process.
>>>
>>>                For further information on cgroup namespaces, see cgroup_namespaces(7).
>>>
>>>                Only a privileged process (CAP_SYS_ADMIN) can employ CLONE_NEWCGROUP.
>>>
>> Is there pthread_attr_setcgroup_np()?
>>
>> Jan
> Don't know... Waiman?

I don't think there is such libpthread call yet.

-Longman