[v3,0/2] cgroup: New misc cgroup controller

Message ID	20210304231946.2766648-1-vipinsh@google.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> Sender: "vipinsh via sendgmr" <vipinsh@vipinsh.kir.corp.google.com> Date: Thu, 4 Mar 2021 15:19:44 -0800 Message-Id: <20210304231946.2766648-1-vipinsh@google.com> Mime-Version: 1.0 Subject: [Patch v3 0/2] cgroup: New misc cgroup controller From: Vipin Sharma <vipinsh@google.com> To: tj@kernel.org, mkoutny@suse.com, rdunlap@infradead.org, thomas.lendacky@amd.com, brijesh.singh@amd.com, jon.grimm@amd.com, eric.vantassell@amd.com, pbonzini@redhat.com, hannes@cmpxchg.org, frankja@linux.ibm.com, borntraeger@de.ibm.com Cc: corbet@lwn.net, seanjc@google.com, vkuznets@redhat.com, wanpengli@tencent.com, jmattson@google.com, joro@8bytes.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, hpa@zytor.com, gingell@google.com, rientjes@google.com, dionnaglaze@google.com, kvm@vger.kernel.org, x86@kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Vipin Sharma <vipinsh@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	cgroup: New misc cgroup controller \| expand [v3,0/2] cgroup: New misc cgroup controller [v3,1/2] cgroup: sev: Add misc cgroup controller [v3,2/2] cgroup: sev: Miscellaneous cgroup documentation.

Vipin Sharma March 4, 2021, 11:19 p.m. UTC

Hello

This patch series is creating a new misc cgroup controller for limiting
and tracking of resources which are not abstract like other cgroup
controllers.

This controller was initially proposed as encryption_id but after
the feedbacks, it is now changed to misc cgroup.
https://lore.kernel.org/lkml/20210108012846.4134815-2-vipinsh@google.com/

Changes in RFC v3:
1. Changed implementation to support 64 bit counters.
2. Print kernel logs only once per resource per cgroup.
3. Capacity can be set less than the current usage.

Changes in RFC v2:
1. Documentation fixes.
2. Added kernel log messages.
3. Changed charge API to treat misc_cg as input parameter.
4. Added helper APIs to get and release references on the cgroup.

[1] https://lore.kernel.org/lkml/20210218195549.1696769-1-vipinsh@google.com
[2] https://lore.kernel.org/lkml/20210302081705.1990283-1-vipinsh@google.com/

Vipin Sharma (2):
  cgroup: sev: Add misc cgroup controller
  cgroup: sev: Miscellaneous cgroup documentation.

 Documentation/admin-guide/cgroup-v1/index.rst |   1 +
 Documentation/admin-guide/cgroup-v1/misc.rst  |   4 +
 Documentation/admin-guide/cgroup-v2.rst       |  69 ++-
 arch/x86/kvm/svm/sev.c                        |  65 ++-
 arch/x86/kvm/svm/svm.h                        |   1 +
 include/linux/cgroup_subsys.h                 |   4 +
 include/linux/misc_cgroup.h                   | 130 ++++++
 init/Kconfig                                  |  14 +
 kernel/cgroup/Makefile                        |   1 +
 kernel/cgroup/misc.c                          | 402 ++++++++++++++++++
 10 files changed, 679 insertions(+), 12 deletions(-)
 create mode 100644 Documentation/admin-guide/cgroup-v1/misc.rst
 create mode 100644 include/linux/misc_cgroup.h
 create mode 100644 kernel/cgroup/misc.c

Tejun Heo March 7, 2021, 12:48 p.m. UTC | #1

Hello,

On Thu, Mar 04, 2021 at 03:19:44PM -0800, Vipin Sharma wrote:
> This patch series is creating a new misc cgroup controller for limiting
> and tracking of resources which are not abstract like other cgroup
> controllers.
> 
> This controller was initially proposed as encryption_id but after
> the feedbacks, it is now changed to misc cgroup.
> https://lore.kernel.org/lkml/20210108012846.4134815-2-vipinsh@google.com/

Vipin, thank you very much for your persistence and patience. The patchset
looks good to me. Michal, as you've been reviewing the series, can you
please take another look and ack them if you don't find anything
objectionable?

Michal Koutný March 11, 2021, 6:58 p.m. UTC | #2

Hello.

On Sun, Mar 07, 2021 at 07:48:40AM -0500, Tejun Heo <tj@kernel.org> wrote:
> Vipin, thank you very much for your persistence and patience.
Yes, and thanks for taking my remarks into account.

> Michal, as you've been reviewing the series, can you please take
> another look and ack them if you don't find anything objectionable?
Honestly, I'm still sitting on the fence whether this needs a new
controller and whether the miscontroller (:-p) is a good approach in the
long term [1].

I admit, I didn't follow the past dicussions completely, however,
(Vipin) could it be in the cover letter/commit messages shortly
summarized why cgroups and a controller were chosen to implement
restrictions of these resources, what were the alternatives any why were
they rejected?

In the previous discussion, I saw the reasoning for the list of the
resources to be hardwired in the controller itself in order to get some
scrutiny of possible changes. That makes sense to me. But with that, is
it necessary to commit to the new controller API via EXPORT_SYMBOL? (I
don't mean this as a licensing question but what the external API should
be (if any).)

Besides the generic remarks above, I'd still suggest some slight
implementation changes, posted inline to the patch.

Thanks,
Michal

[1] Currently, only one thing comes to my mind -- the delegation via
cgroup.subtree_control. The miscontroller may add possibly further
resources whose delegation granularity is bunched up under one entry.

Tejun Heo March 11, 2021, 7:39 p.m. UTC | #3

Hello,

On Thu, Mar 11, 2021 at 07:58:19PM +0100, Michal Koutný wrote:
> > Michal, as you've been reviewing the series, can you please take
> > another look and ack them if you don't find anything objectionable?
> Honestly, I'm still sitting on the fence whether this needs a new
> controller and whether the miscontroller (:-p) is a good approach in the
> long term [1].

Yeah, it's a bit of cop-out. My take is that the underlying hardware feature
isn't mature enough to have reasonable abstraction built on top of them.
Given time, maybe future iterations will get there or maybe it's a passing
fad and people will mostly forget about these.

In the meantime, keeping them out of cgroup is one direction, a relatively
high friction one but still viable. Or we can provide something of a halfway
house so that people who have immediate needs can still leverage the
existing infrastructure while controlling the amount of time, energy and
future lock-ins they take. So, that's misc controller.

I'm somewhat ambivalent but we've had multiple of these things popping up in
the past several years and containment seems to be a reasonable approach at
this point.

> [1] Currently, only one thing comes to my mind -- the delegation via
> cgroup.subtree_control. The miscontroller may add possibly further
> resources whose delegation granularity is bunched up under one entry.

Controller enabling and delegation in themselves aren't supposed to have
resource or security implications, so I don't think it's a practical
problem.

Thanks.

Vipin Sharma March 12, 2021, 5:49 p.m. UTC | #4

On Thu, Mar 11, 2021 at 07:58:19PM +0100, Michal Koutný wrote:
> I admit, I didn't follow the past dicussions completely, however,
> (Vipin) could it be in the cover letter/commit messages shortly
> summarized why cgroups and a controller were chosen to implement
> restrictions of these resources, what were the alternatives any why were
> they rejected?

I will add some more information in the cover letter of the next version.

Basically, SEV will mostly be used by cloud providers for providing
confidential VMs. Since they are limited we need a good way to schedule
these jobs in cloud infrastructure. To achieve this we either come up
with some ioctl for "/dev/sev" to know about its usage, availability,
etc. This requires existing scheduling mechanism in the cloud to have an
extension for this interaction. Now same thing needs to be done for TDX.
IBM SEID doesn't have scarcity of this resource but they are also
interested in tracking and limiting the usage. Each one coming up with
their own interaction is a duplicate effort when they all need similar
thing. One can say that abstraction should be at KVM level but these
resources can be used outside VM as well.

Most of the cloud infrastructure use cgroups for knowing the host state,
track the resources usage, enforce limits on them, etc. They use this
info to optimize work allocation in the fleet and make sure no rogue job
consumes more than it needs and starves other. Adding these resources
to cgroup is a natural choice with least friction. Cgroup itself says it
is a mechanism to distribute system resources along the hierarchy in a
controlled mechanism and configurable manner. Most of the resources in
cgroups are abstracted enough but their are still resources which are
not abstract but have limited availability or have specific use cases.

> 
> In the previous discussion, I saw the reasoning for the list of the
> resources to be hardwired in the controller itself in order to get some
> scrutiny of possible changes. That makes sense to me. But with that, is
> it necessary to commit to the new controller API via EXPORT_SYMBOL? (I
> don't mean this as a licensing question but what the external API should
> be (if any).)

As per my understanding this is the only for way for loadable modules
(kvm-amd in this case) to access Kernel APIs. Let me know if there is a
better way to do it.

> 
> Besides the generic remarks above, I'd still suggest some slight
> implementation changes, posted inline to the patch.

I will work on them.

I appreciate you guys taking out time and helping me out with this patch
series.

Thanks
Vipin

Michal Koutný March 15, 2021, 7:10 p.m. UTC | #5

On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma <vipinsh@google.com> wrote:
> I will add some more information in the cover letter of the next version.
Thanks.

> Each one coming up with their own interaction is a duplicate effort
> when they all need similar thing.
Could this be expressed as a new BPF hook (when allocating/freeing such
a resource unit)?

The decision could be made based on the configured limit or even some
other predicate.

(I saw this proposed already but I haven't seen some more reasoning
whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown
controllers, though it's still new user API.)


> As per my understanding this is the only for way for loadable modules
> (kvm-amd in this case) to access Kernel APIs. Let me know if there is a
> better way to do it.
I understood the symbols are exported for such modularized builds.
However, making them non-GPL exposes them to any out-of-tree modules,
although, the resource types are supposed to stay hardcoded in the misc
controller. So my point was to make them EXPORT_SYMBOL_GPL to mark
they're just a means of implementing the modularized builds and not an
API. (But they'd remain API for out-of-tree GPL modules anyway, so take
this reasoning of mine with a grain of salt.)

Michal

Vipin Sharma March 22, 2021, 6:24 p.m. UTC | #6

On Mon, Mar 15, 2021 at 08:10:09PM +0100, Michal Koutný wrote:
> On Fri, Mar 12, 2021 at 09:49:26AM -0800, Vipin Sharma <vipinsh@google.com> wrote:
> > I will add some more information in the cover letter of the next version.
> Thanks.
> 
> > Each one coming up with their own interaction is a duplicate effort
> > when they all need similar thing.
> Could this be expressed as a new BPF hook (when allocating/freeing such
> a resource unit)?
> 
> The decision could be made based on the configured limit or even some
> other predicate.
> 
> (I saw this proposed already but I haven't seen some more reasoning
> whether it's worse/better. IMO, BPF hooks are "cheaper" than full-blown
> controllers, though it's still new user API.)

I am not much knowledgeable with BPF, so, I might be wrong here.

There are couple of things which might not be addressed with BPF:
1. Which controller to use in v1 case? These are not abstract resources
   so in v1 where each controller have their own hierarchy it might not
   be easy to identify the best controller.

2. It seems to me that we won't be able to abstract out a single BPF
   program which can help with all of the resources types we are
   planning to use, again, because it is not an abstract type like
   network packets, and there will be different places in the source
   code to use these resources.

To me a cgroup tends to give much easier and well integrated solution when it
comes to scheduling and limiting a resource with existing tools in a
cloud infrastructure.

> 
> 
> > As per my understanding this is the only for way for loadable modules
> > (kvm-amd in this case) to access Kernel APIs. Let me know if there is a
> > better way to do it.
> I understood the symbols are exported for such modularized builds.
> However, making them non-GPL exposes them to any out-of-tree modules,
> although, the resource types are supposed to stay hardcoded in the misc
> controller. So my point was to make them EXPORT_SYMBOL_GPL to mark
> they're just a means of implementing the modularized builds and not an
> API. (But they'd remain API for out-of-tree GPL modules anyway, so take
> this reasoning of mine with a grain of salt.)
> 
I see, I will change it to GPL.

Thanks
Vipin

[v3,0/2] cgroup: New misc cgroup controller

Message

Comments