mbox series

[0/9] memcg accounting from OpenVZ

Message ID 5affff71-e503-9fb9-50cb-f6d48286dd52@virtuozzo.com (mailing list archive)
Headers show
Series memcg accounting from OpenVZ | expand

Message

Vasily Averin March 9, 2021, 8:03 a.m. UTC
OpenVZ many years accounted memory of few kernel objects,
this helps us to prevent host memory abuse from inside memcg-limited container.

Vasily Averin (9):
  memcg: accounting for allocations called with disabled BH
  memcg: accounting for fib6_nodes cache
  memcg: accounting for ip6_dst_cache
  memcg: accounting for fib_rules
  memcg: accounting for ip_fib caches
  memcg: accounting for fasync_cache
  memcg: accounting for mnt_cache entries
  memcg: accounting for tty_struct objects
  memcg: accounting for ldt_struct objects

 arch/x86/kernel/ldt.c | 7 ++++---
 drivers/tty/tty_io.c  | 4 ++--
 fs/fcntl.c            | 3 ++-
 fs/namespace.c        | 5 +++--
 mm/memcontrol.c       | 2 +-
 net/core/fib_rules.c  | 4 ++--
 net/ipv4/fib_trie.c   | 4 ++--
 net/ipv6/ip6_fib.c    | 2 +-
 net/ipv6/route.c      | 2 +-
 9 files changed, 18 insertions(+), 15 deletions(-)

Comments

Shakeel Butt March 9, 2021, 9:12 p.m. UTC | #1
On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> OpenVZ many years accounted memory of few kernel objects,
> this helps us to prevent host memory abuse from inside memcg-limited container.
>

The text is cryptic but I am assuming you wanted to say that OpenVZ
has remained on a kernel which was still on opt-out kmem accounting
i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
patches are needed, right?
Vasily Averin March 10, 2021, 10:17 a.m. UTC | #2
On 3/10/21 12:12 AM, Shakeel Butt wrote:
> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> OpenVZ many years accounted memory of few kernel objects,
>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>
> 
> The text is cryptic but I am assuming you wanted to say that OpenVZ
> has remained on a kernel which was still on opt-out kmem accounting
> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> patches are needed, right?

Something like this.
Frankly speaking I badly understand which arguments should I provide to upstream
to enable accounting for some new king of objects.

OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
and we have accounted all required kernel objects by using our own patches.
When memcg was added to upstream Vladimir Davydov added accounting of some objects
to upstream but did not skipped another ones.
Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
"skipped" objects by our own patches just because we accounted such objects before.
We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

Thank you,
	Vasily Averin
Michal Hocko March 10, 2021, 10:41 a.m. UTC | #3
On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> > On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> OpenVZ many years accounted memory of few kernel objects,
> >> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>
> > 
> > The text is cryptic but I am assuming you wanted to say that OpenVZ
> > has remained on a kernel which was still on opt-out kmem accounting
> > i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> > patches are needed, right?
> 
> Something like this.
> Frankly speaking I badly understand which arguments should I provide to upstream
> to enable accounting for some new king of objects.
> 
> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
> and we have accounted all required kernel objects by using our own patches.
> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> to upstream but did not skipped another ones.
> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> "skipped" objects by our own patches just because we accounted such objects before.
> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 

That is certainly an interesting information. But for a changelog it
would be more appropriate to provide information about how much memory
user can induce and whether there is any way to limit that memory by
other means. How practical those other means are and which usecases will
benefit from the containment.
Vasily Averin March 11, 2021, 7 a.m. UTC | #4
On 3/10/21 1:41 PM, Michal Hocko wrote:
> On Wed 10-03-21 13:17:19, Vasily Averin wrote:
>> On 3/10/21 12:12 AM, Shakeel Butt wrote:
>>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>>>
>>>> OpenVZ many years accounted memory of few kernel objects,
>>>> this helps us to prevent host memory abuse from inside memcg-limited container.
>>>
>>> The text is cryptic but I am assuming you wanted to say that OpenVZ
>>> has remained on a kernel which was still on opt-out kmem accounting
>>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
>>> patches are needed, right?
>>
>> Something like this.
>> Frankly speaking I badly understand which arguments should I provide to upstream
>> to enable accounting for some new king of objects.
>>
>> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels) 
>> and we have accounted all required kernel objects by using our own patches.
>> When memcg was added to upstream Vladimir Davydov added accounting of some objects
>> to upstream but did not skipped another ones.
>> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
>> "skipped" objects by our own patches just because we accounted such objects before.
>> We're working on rebase to new kernels and we prefer to push our old patches to upstream. 
> 
> That is certainly an interesting information. But for a changelog it
> would be more appropriate to provide information about how much memory
> user can induce and whether there is any way to limit that memory by
> other means. How practical those other means are and which usecases will
> benefit from the containment.

Right now I would like to understand how should I argument my requests about
accounting of new kind of objects.

Which description it enough to enable object accounting?
Could you please specify some edge rules?
Should I push such patches trough this list? 
Is it probably better to send them to mailing lists of according subsystems?
Should I notify them somehow at least?

"untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"

"each mount inside memcg-limited container creates non-accounted mount object,
 but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"

"unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"

"non-accounted multi-page tty objects can be created from inside memcg-limited container"

"unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

Thank you,
	Vasily Averin
Michal Hocko March 11, 2021, 8:35 a.m. UTC | #5
On Thu 11-03-21 10:00:17, Vasily Averin wrote:
> On 3/10/21 1:41 PM, Michal Hocko wrote:
[...]
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
> 
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
> 
> Which description it enough to enable object accounting?

Doesn't the above paragraph give you a hint?

> Could you please specify some edge rules?

There are no strong rules AFAIK. I would say that it is important is
that the user can trigger a lot of or unbound amount of objects.

> Should I push such patches trough this list? 

yes linux-mm and ccing memcg maintainers is the proper way. It would be
great to CC maintainers of the affected subsystem as well.

> Is it probably better to send them to mailing lists of according subsystems?

> Should I notify them somehow at least?
> 
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
> 
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
> 
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
> 
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
> 
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"

OK, that sounds better to me. It would be also great if you can mention
whether there are any other means to limit those objects if there are
any.

Thanks!
Shakeel Butt March 11, 2021, 3:14 p.m. UTC | #6
On Wed, Mar 10, 2021 at 11:00 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 3/10/21 1:41 PM, Michal Hocko wrote:
> > On Wed 10-03-21 13:17:19, Vasily Averin wrote:
> >> On 3/10/21 12:12 AM, Shakeel Butt wrote:
> >>> On Tue, Mar 9, 2021 at 12:04 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>>>
> >>>> OpenVZ many years accounted memory of few kernel objects,
> >>>> this helps us to prevent host memory abuse from inside memcg-limited container.
> >>>
> >>> The text is cryptic but I am assuming you wanted to say that OpenVZ
> >>> has remained on a kernel which was still on opt-out kmem accounting
> >>> i.e. <4.5. Now OpenVZ wants to move to a newer kernel and thus these
> >>> patches are needed, right?
> >>
> >> Something like this.
> >> Frankly speaking I badly understand which arguments should I provide to upstream
> >> to enable accounting for some new king of objects.
> >>
> >> OpenVZ used own accounting subsystem since 2001 (i.e. since v2.2.x linux kernels)
> >> and we have accounted all required kernel objects by using our own patches.
> >> When memcg was added to upstream Vladimir Davydov added accounting of some objects
> >> to upstream but did not skipped another ones.
> >> Now OpenVZ uses RHEL7-based kernels with cgroup v1 in production, and we still account
> >> "skipped" objects by our own patches just because we accounted such objects before.
> >> We're working on rebase to new kernels and we prefer to push our old patches to upstream.
> >
> > That is certainly an interesting information. But for a changelog it
> > would be more appropriate to provide information about how much memory
> > user can induce and whether there is any way to limit that memory by
> > other means. How practical those other means are and which usecases will
> > benefit from the containment.
>
> Right now I would like to understand how should I argument my requests about
> accounting of new kind of objects.
>
> Which description it enough to enable object accounting?
> Could you please specify some edge rules?
> Should I push such patches trough this list?
> Is it probably better to send them to mailing lists of according subsystems?
> Should I notify them somehow at least?
>
> "untrusted netadmin inside memcg-limited container can create unlimited number of routing entries, trigger OOM on host that will be unable to find the reason of memory  shortage and  kill huge"
>
> "each mount inside memcg-limited container creates non-accounted mount object,
>  but new mount namespace creation consumes huge piece of non-accounted memory for cloned mounts"
>
> "unprivileged user inside memcg-limited container can create non-accounted multi-page per-thread kernel objects for LDT"
>
> "non-accounted multi-page tty objects can be created from inside memcg-limited container"
>
> "unprivileged user inside memcg-limited container can trigger creation of huge number of non-accounted fasync_struct objects"
>

I think the above reasoning is good enough. Just resend your patches
with the corresponding details.
Borislav Petkov March 15, 2021, 1:27 p.m. UTC | #7
On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> Unprivileged user inside memcg-limited container can create
> non-accounted multi-page per-thread kernel objects for LDT

I have hard time parsing this commit message.

And I'm CCed only on patch 8 of what looks like a patchset.

And that patchset is not on lkml so I can't find the rest to read about
it, perhaps linux-mm.

/me goes and finds it on lore

I can see some bits and pieces, this, for example:

https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/

 ( Btw, that version has your SOB and this patch doesn't even have a
   Signed-off-by. Next time, run your whole set through checkpatch please
   before sending. )

Now, this URL above talks about OOM, ok, that gets me close to the "why"
this patch.

From a quick look at the ldt.c code, we allow a single LDT struct per
mm. Manpage says so too:

DESCRIPTION
       modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
       The LDT is an array of segment descriptors that can be referenced by user  code.
       Linux  allows  processes  to configure a per-process (actually per-mm) LDT.

We allow

/* Maximum number of LDT entries supported. */
#define LDT_ENTRIES     8192

so there's an upper limit per mm.

Now, please explain what is this accounting for?

Thx.
David Ahern March 15, 2021, 3:13 p.m. UTC | #8
On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead
> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>
David Ahern March 15, 2021, 3:14 p.m. UTC | #9
On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patches enables accounting for 'struct rt6_info' allocations.
> ---
>  net/ipv6/route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>
David Ahern March 15, 2021, 3:14 p.m. UTC | #10
On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for 'struct fib_rules'
> ---
>  net/core/fib_rules.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>
David Ahern March 15, 2021, 3:15 p.m. UTC | #11
On 3/15/21 6:23 AM, Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> This patch enables accounting for ip_fib_alias and ip_fib_trie caches
> ---
>  net/ipv4/fib_trie.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Acked-by: David Ahern <dsahern@kernel.org>
Shakeel Butt March 15, 2021, 3:23 p.m. UTC | #12
On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
>
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
>
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
>
> It this case is

'In this case it is'

> not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt

'context'

> '!in_task()' check should be used instead

You missed the signoff and it seems like the whole series is missing
it as well. Please run scripts/checkpatch.pl on the patches before
sending again.

> ---
>  mm/memcontrol.c    | 2 +-
>  net/ipv6/ip6_fib.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec0..568f2cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1076,7 +1076,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>                 return false;
>
>         /* Memcg to charge can't be determined. */
> -       if (in_interrupt() || !current->mm || (current->flags & PF_KTHREAD))
> +       if (!in_task() || !current->mm || (current->flags & PF_KTHREAD))

Can you please also add some explanation in the commit message on the
differences between in_interrupt() and in_task()? Why is
in_interrupt() not correct here but !in_task() is? What about kernels
with or without PREEMPT_COUNT?

>                 return true;
>
>         return false;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index ef9d022..fa92ed1 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -2445,7 +2445,7 @@ int __init fib6_init(void)
>
>         fib6_node_kmem = kmem_cache_create("fib6_nodes",
>                                            sizeof(struct fib6_node),
> -                                          0, SLAB_HWCACHE_ALIGN,
> +                                          0, SLAB_HWCACHE_ALIGN|SLAB_ACCOUNT,
>                                            NULL);
>         if (!fib6_node_kmem)
>                 goto out;
> --
> 1.8.3.1
>
Shakeel Butt March 15, 2021, 3:48 p.m. UTC | #13
On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > Unprivileged user inside memcg-limited container can create
> > non-accounted multi-page per-thread kernel objects for LDT
>
> I have hard time parsing this commit message.
>
> And I'm CCed only on patch 8 of what looks like a patchset.
>
> And that patchset is not on lkml so I can't find the rest to read about
> it, perhaps linux-mm.
>
> /me goes and finds it on lore
>
> I can see some bits and pieces, this, for example:
>
> https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
>
>  ( Btw, that version has your SOB and this patch doesn't even have a
>    Signed-off-by. Next time, run your whole set through checkpatch please
>    before sending. )
>
> Now, this URL above talks about OOM, ok, that gets me close to the "why"
> this patch.
>
> From a quick look at the ldt.c code, we allow a single LDT struct per
> mm. Manpage says so too:
>
> DESCRIPTION
>        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
>        The LDT is an array of segment descriptors that can be referenced by user  code.
>        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
>
> We allow
>
> /* Maximum number of LDT entries supported. */
> #define LDT_ENTRIES     8192
>
> so there's an upper limit per mm.
>
> Now, please explain what is this accounting for?
>

Let me try to provide the reasoning at least from my perspective.
There are legitimate workloads with hundreds of processes and there
can be hundreds of workloads running on large machines. The
unaccounted memory can cause isolation issues between the workloads
particularly on highly utilized machines.
Shakeel Butt March 15, 2021, 3:56 p.m. UTC | #14
On Mon, Mar 15, 2021 at 5:23 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> unprivileged user inside memcg-limited container can trigger
> creation of huge number of non-accounted fasync_struct objects

You need to make each patch of this series self-contained by including
the motivation behind the series (just one or two sentences). For
example, for this patch include what's the potential impact of these
huge numbers of unaccounted fasync_struct objects?

> ---
>  fs/fcntl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index dfc72f1..7941559 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
>                         __FMODE_EXEC | __FMODE_NONOTIFY));
>
>         fasync_cache = kmem_cache_create("fasync_cache",
> -               sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
> +                                        sizeof(struct fasync_struct), 0,
> +                                        SLAB_PANIC | SLAB_ACCOUNT, NULL);
>         return 0;
>  }
>
> --
> 1.8.3.1
>
Michal Hocko March 15, 2021, 3:58 p.m. UTC | #15
On Mon 15-03-21 08:48:26, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 6:27 AM Borislav Petkov <bp@alien8.de> wrote:
> >
> > On Mon, Mar 15, 2021 at 03:24:01PM +0300, Vasily Averin wrote:
> > > Unprivileged user inside memcg-limited container can create
> > > non-accounted multi-page per-thread kernel objects for LDT
> >
> > I have hard time parsing this commit message.
> >
> > And I'm CCed only on patch 8 of what looks like a patchset.
> >
> > And that patchset is not on lkml so I can't find the rest to read about
> > it, perhaps linux-mm.
> >
> > /me goes and finds it on lore
> >
> > I can see some bits and pieces, this, for example:
> >
> > https://lore.kernel.org/linux-mm/05c448c7-d992-8d80-b423-b80bf5446d7c@virtuozzo.com/
> >
> >  ( Btw, that version has your SOB and this patch doesn't even have a
> >    Signed-off-by. Next time, run your whole set through checkpatch please
> >    before sending. )
> >
> > Now, this URL above talks about OOM, ok, that gets me close to the "why"
> > this patch.
> >
> > From a quick look at the ldt.c code, we allow a single LDT struct per
> > mm. Manpage says so too:
> >
> > DESCRIPTION
> >        modify_ldt()  reads  or  writes  the local descriptor table (LDT) for a process.
> >        The LDT is an array of segment descriptors that can be referenced by user  code.
> >        Linux  allows  processes  to configure a per-process (actually per-mm) LDT.
> >
> > We allow
> >
> > /* Maximum number of LDT entries supported. */
> > #define LDT_ENTRIES     8192
> >
> > so there's an upper limit per mm.
> >
> > Now, please explain what is this accounting for?
> >
> 
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

It would be better to be explicit

8192 * 8 = 64kB * number_of_tasks

so realistically this is in range of lower megabytes. Is this worth the
memcg accounting overhead? Maybe yes but what kind of workloads really
care?
Borislav Petkov March 15, 2021, 3:59 p.m. UTC | #16
On Mon, Mar 15, 2021 at 08:48:26AM -0700, Shakeel Butt wrote:
> Let me try to provide the reasoning at least from my perspective.
> There are legitimate workloads with hundreds of processes and there
> can be hundreds of workloads running on large machines. The
> unaccounted memory can cause isolation issues between the workloads
> particularly on highly utilized machines.

Good enough for me, as long as that is part of the commit message.

Thx.
Jakub Kicinski March 15, 2021, 5:09 p.m. UTC | #17
On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> An untrusted netadmin inside a memcg-limited container can create a
> huge number of routing entries. Currently, allocated kernel objects
> are not accounted to proper memcg, so this can lead to global memory
> shortage on the host and cause lot of OOM kiils.
> 
> One such object is the 'struct fib6_node' mostly allocated in
> net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> 
>  write_lock_bh(&table->tb6_lock);
>  err = fib6_add(&table->tb6_root, rt, info, mxc);
>  write_unlock_bh(&table->tb6_lock);
> 
> It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> kmem cache. The proper memory cgroup still cannot be found due to the
> incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> To be sure that caller is not executed in process contxt
> '!in_task()' check should be used instead

Sorry for a random question, I didn't get the cover letter. 

What's the overhead of adding SLAB_ACCOUNT?

Please make sure you CC netdev on series which may impact networking.
Shakeel Butt March 15, 2021, 7:24 p.m. UTC | #18
On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > An untrusted netadmin inside a memcg-limited container can create a
> > huge number of routing entries. Currently, allocated kernel objects
> > are not accounted to proper memcg, so this can lead to global memory
> > shortage on the host and cause lot of OOM kiils.
> >
> > One such object is the 'struct fib6_node' mostly allocated in
> > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> >
> >  write_lock_bh(&table->tb6_lock);
> >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> >  write_unlock_bh(&table->tb6_lock);
> >
> > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > kmem cache. The proper memory cgroup still cannot be found due to the
> > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > To be sure that caller is not executed in process contxt
> > '!in_task()' check should be used instead
>
> Sorry for a random question, I didn't get the cover letter.
>
> What's the overhead of adding SLAB_ACCOUNT?
>

The potential overhead is for MEMCG users where we need to
charge/account each allocation from SLAB_ACCOUNT kmem caches. However
charging is done in batches, so the cost is amortized. If there is a
concern about a specific workload then it would be good to see the
impact of this patch for that workload.

> Please make sure you CC netdev on series which may impact networking.
Roman Gushchin March 15, 2021, 7:32 p.m. UTC | #19
On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 15 Mar 2021 15:23:00 +0300 Vasily Averin wrote:
> > > An untrusted netadmin inside a memcg-limited container can create a
> > > huge number of routing entries. Currently, allocated kernel objects
> > > are not accounted to proper memcg, so this can lead to global memory
> > > shortage on the host and cause lot of OOM kiils.
> > >
> > > One such object is the 'struct fib6_node' mostly allocated in
> > > net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section:
> > >
> > >  write_lock_bh(&table->tb6_lock);
> > >  err = fib6_add(&table->tb6_root, rt, info, mxc);
> > >  write_unlock_bh(&table->tb6_lock);
> > >
> > > It this case is not enough to simply add SLAB_ACCOUNT to corresponding
> > > kmem cache. The proper memory cgroup still cannot be found due to the
> > > incorrect 'in_interrupt()' check used in memcg_kmem_bypass().
> > > To be sure that caller is not executed in process contxt
> > > '!in_task()' check should be used instead
> >
> > Sorry for a random question, I didn't get the cover letter.
> >
> > What's the overhead of adding SLAB_ACCOUNT?
> >
> 
> The potential overhead is for MEMCG users where we need to
> charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> charging is done in batches, so the cost is amortized. If there is a
> concern about a specific workload then it would be good to see the
> impact of this patch for that workload.
> 
> > Please make sure you CC netdev on series which may impact networking.

In general the overhead is not that big, so I don't think we should argue
too much about every new case where we want to enable the accounting and
rather focus on those few examples (if any?) where it actually hurts
the performance in a meaningful way.

Thanks!
Jakub Kicinski March 15, 2021, 7:35 p.m. UTC | #20
On Mon, 15 Mar 2021 12:32:07 -0700 Roman Gushchin wrote:
> On Mon, Mar 15, 2021 at 12:24:31PM -0700, Shakeel Butt wrote:
> > On Mon, Mar 15, 2021 at 10:09 AM Jakub Kicinski <kuba@kernel.org> wrote:  
> > > Sorry for a random question, I didn't get the cover letter.
> > >
> > > What's the overhead of adding SLAB_ACCOUNT?
> > 
> > The potential overhead is for MEMCG users where we need to
> > charge/account each allocation from SLAB_ACCOUNT kmem caches. However
> > charging is done in batches, so the cost is amortized. If there is a
> > concern about a specific workload then it would be good to see the
> > impact of this patch for that workload.
> >   
> > > Please make sure you CC netdev on series which may impact networking.  
> 
> In general the overhead is not that big, so I don't think we should argue
> too much about every new case where we want to enable the accounting and
> rather focus on those few examples (if any?) where it actually hurts
> the performance in a meaningful way.

Ack, no serious concerns about this particular case.

I was expecting you'd have micro benchmark numbers handy so I was
curious to learn what they are, but that appears not to be the case.