[RFC] net: memcg accounting for veth devices

Message ID	a5e09e93-106d-0527-5b1e-48dbf3b48b4e@virtuozzo.com (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> Message-ID: <a5e09e93-106d-0527-5b1e-48dbf3b48b4e@virtuozzo.com> Date: Mon, 28 Feb 2022 10:17:16 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 From: Vasily Averin <vvs@virtuozzo.com> Subject: [PATCH RFC] net: memcg accounting for veth devices To: Roman Gushchin <roman.gushchin@linux.dev>, Linux MM <linux-mm@kvack.org> Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Tejun Heo <tj@kernel.org>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Eric Dumazet <edumazet@google.com>, Luis Chamberlain <mcgrof@kernel.org>, Kees Cook <keescook@chromium.org>, Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>, David Ahern <dsahern@kernel.org>, linux-kernel@vger.kernel.org, kernel@openvz.org Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit MIME-Version: 1.0 Precedence: bulk
Series	[RFC] net: memcg accounting for veth devices \| expand [RFC] net: memcg accounting for veth devices

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Vasily Averin Feb. 28, 2022, 7:17 a.m. UTC

Following one-liner running inside memcg-limited container consumes
huge number of host memory and can trigger global OOM.

for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done

Patch accounts most part of these allocations and can protect host.
---[cut]---
It is not polished, and perhaps should be splitted.
obviously it affects other kind of netdevices too.
Unfortunately I'm not sure that I will have enough time to handle it properly
and decided to publish current patch version as is.
OpenVz workaround it by using per-container limit for number of
available netdevices, but upstream does not have any kind of
per-container configuration.
------

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
  drivers/net/veth.c    | 2 +-
  fs/kernfs/mount.c     | 2 +-
  fs/proc/proc_sysctl.c | 3 ++-
  net/core/neighbour.c  | 4 ++--
  net/ipv4/devinet.c    | 2 +-
  net/ipv6/addrconf.c   | 6 +++---
  6 files changed, 10 insertions(+), 9 deletions(-)

Luis Chamberlain Feb. 28, 2022, 2:36 p.m. UTC | #1

On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> Following one-liner running inside memcg-limited container consumes
> huge number of host memory and can trigger global OOM.
> 
> for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> 
> Patch accounts most part of these allocations and can protect host.
> ---[cut]---
> It is not polished, and perhaps should be splitted.
> obviously it affects other kind of netdevices too.
> Unfortunately I'm not sure that I will have enough time to handle it properly
> and decided to publish current patch version as is.
> OpenVz workaround it by using per-container limit for number of
> available netdevices, but upstream does not have any kind of
> per-container configuration.
> ------

Should this just be a new ucount limit on kernel/ucount.c and have veth
use something like inc_ucount(current_user_ns(), current_euid(), UCOUNT_VETH)?

This might be abusing ucounts though, not sure, Eric?

  Luis
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  drivers/net/veth.c    | 2 +-
>  fs/kernfs/mount.c     | 2 +-
>  fs/proc/proc_sysctl.c | 3 ++-
>  net/core/neighbour.c  | 4 ++--
>  net/ipv4/devinet.c    | 2 +-
>  net/ipv6/addrconf.c   | 6 +++---
>  6 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 354a963075c5..6e0b4a9d0843 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1307,7 +1307,7 @@ static int veth_alloc_queues(struct net_device *dev)
>  	struct veth_priv *priv = netdev_priv(dev);
>  	int i;
> -	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
> +	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
>  	if (!priv->rq)
>  		return -ENOMEM;
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index cfa79715fc1a..2881aeeaa880 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -391,7 +391,7 @@ void __init kernfs_init(void)
>  {
>  	kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
>  					      sizeof(struct kernfs_node),
> -					      0, SLAB_PANIC, NULL);
> +					      0, SLAB_PANIC | SLAB_ACCOUNT, NULL);
>  	/* Creates slab cache for kernfs inode attributes */
>  	kernfs_iattrs_cache  = kmem_cache_create("kernfs_iattrs_cache",
> diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
> index 7d9cfc730bd4..e20ce8198a44 100644
> --- a/fs/proc/proc_sysctl.c
> +++ b/fs/proc/proc_sysctl.c
> @@ -1333,7 +1333,8 @@ struct ctl_table_header *__register_sysctl_table(
>  		nr_entries++;
>  	header = kzalloc(sizeof(struct ctl_table_header) +
> -			 sizeof(struct ctl_node)*nr_entries, GFP_KERNEL);
> +			 sizeof(struct ctl_node)*nr_entries,
> +			 GFP_KERNEL_ACCOUNT);
>  	if (!header)
>  		return NULL;
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index ec0bf737b076..66a4445421f1 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1665,7 +1665,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
>  	struct net *net = dev_net(dev);
>  	const struct net_device_ops *ops = dev->netdev_ops;
> -	p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL);
> +	p = kmemdup(&tbl->parms, sizeof(*p), GFP_KERNEL_ACCOUNT);
>  	if (p) {
>  		p->tbl		  = tbl;
>  		refcount_set(&p->refcnt, 1);
> @@ -3728,7 +3728,7 @@ int neigh_sysctl_register(struct net_device *dev, struct neigh_parms *p,
>  	char neigh_path[ sizeof("net//neigh/") + IFNAMSIZ + IFNAMSIZ ];
>  	char *p_name;
> -	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL);
> +	t = kmemdup(&neigh_sysctl_template, sizeof(*t), GFP_KERNEL_ACCOUNT);
>  	if (!t)
>  		goto err;
> diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
> index fba2bffd65f7..47523fe5b891 100644
> --- a/net/ipv4/devinet.c
> +++ b/net/ipv4/devinet.c
> @@ -2566,7 +2566,7 @@ static int __devinet_sysctl_register(struct net *net, char *dev_name,
>  	struct devinet_sysctl_table *t;
>  	char path[sizeof("net/ipv4/conf/") + IFNAMSIZ];
> -	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL);
> +	t = kmemdup(&devinet_sysctl, sizeof(*t), GFP_KERNEL_ACCOUNT);
>  	if (!t)
>  		goto out;
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index f927c199a93c..9d903342bc41 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -358,7 +358,7 @@ static int snmp6_alloc_dev(struct inet6_dev *idev)
>  	if (!idev->stats.icmpv6dev)
>  		goto err_icmp;
>  	idev->stats.icmpv6msgdev = kzalloc(sizeof(struct icmpv6msg_mib_device),
> -					   GFP_KERNEL);
> +					   GFP_KERNEL_ACCOUNT);
>  	if (!idev->stats.icmpv6msgdev)
>  		goto err_icmpmsg;
> @@ -382,7 +382,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device *dev)
>  	if (dev->mtu < IPV6_MIN_MTU)
>  		return ERR_PTR(-EINVAL);
> -	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
> +	ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL_ACCOUNT);
>  	if (!ndev)
>  		return ERR_PTR(err);
> @@ -7023,7 +7023,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
>  	struct ctl_table *table;
>  	char path[sizeof("net/ipv6/conf/") + IFNAMSIZ];
> -	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL);
> +	table = kmemdup(addrconf_sysctl, sizeof(addrconf_sysctl), GFP_KERNEL_ACCOUNT);
>  	if (!table)
>  		goto out;
> -- 
> 2.25.1
>

Shakeel Butt March 1, 2022, 6:09 p.m. UTC | #2

On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> > Following one-liner running inside memcg-limited container consumes
> > huge number of host memory and can trigger global OOM.
> >
> > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> >
> > Patch accounts most part of these allocations and can protect host.
> > ---[cut]---
> > It is not polished, and perhaps should be splitted.
> > obviously it affects other kind of netdevices too.
> > Unfortunately I'm not sure that I will have enough time to handle it  
> properly
> > and decided to publish current patch version as is.
> > OpenVz workaround it by using per-container limit for number of
> > available netdevices, but upstream does not have any kind of
> > per-container configuration.
> > ------

> Should this just be a new ucount limit on kernel/ucount.c and have veth
> use something like inc_ucount(current_user_ns(), current_euid(),  
> UCOUNT_VETH)?

> This might be abusing ucounts though, not sure, Eric?

For admins of systems running multiple workloads, there is no easy way
to set such limits for each workload. Some may genuinely need more veth
than others. From admin's perspective it is preferred to have minimal
knobs to set and if these objects are charged to memcg then the memcg
limits would limit them. There was similar situation for inotify
instances where fs sysctl inotify/max_user_instances already limits the
inotify instances but we memcg charged them to not worry about setting
such limits. See ac7b79fd190b ("inotify, memcg: account inotify
instances to kmemcg").

Luis Chamberlain March 1, 2022, 6:28 p.m. UTC | #3

On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> > > Following one-liner running inside memcg-limited container consumes
> > > huge number of host memory and can trigger global OOM.
> > >
> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> > >
> > > Patch accounts most part of these allocations and can protect host.
> > > ---[cut]---
> > > It is not polished, and perhaps should be splitted.
> > > obviously it affects other kind of netdevices too.
> > > Unfortunately I'm not sure that I will have enough time to handle it
> > properly
> > > and decided to publish current patch version as is.
> > > OpenVz workaround it by using per-container limit for number of
> > > available netdevices, but upstream does not have any kind of
> > > per-container configuration.
> > > ------
> 
> > Should this just be a new ucount limit on kernel/ucount.c and have veth
> > use something like inc_ucount(current_user_ns(), current_euid(),
> > UCOUNT_VETH)?
> 
> > This might be abusing ucounts though, not sure, Eric?
> 
> 
> For admins of systems running multiple workloads, there is no easy way
> to set such limits for each workload.

That's why defaults would exist. Today's ulimits IMHO are insane and
some are arbitrarily large.

> Some may genuinely need more veth
> than others.

So why not make it a high sensible but not enough to OOM a typical system?

But again, I'd like to hear whether or not a ulimit for veth is a mi-use
of ulimits or if its the right place. If it's not then perhaps the
driver can just have its own atomic max definition.

> From admin's perspective it is preferred to have minimal
> knobs to set and if these objects are charged to memcg then the memcg
> limits would limit them. There was similar situation for inotify
> instances where fs sysctl inotify/max_user_instances already limits the
> inotify instances but we memcg charged them to not worry about setting
> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
> instances to kmemcg").

Yes but we want sensible defaults out of the box. What those should be
IMHO might be work which needs to be figured out well.

IMHO today's ulimits are a bit over the top today. This is off slightly
off topic but for instance play with:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make -j 8
echo 0 > /proc/sys/vm/oom_dump_tasks                                            
i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000;  sleep 1; let i=$i+1; done

If you see:

[  217.798124] cgroup: fork rejected by pids controller in
/user.slice/user-1000.slice/session-1.scope
                                                                                
Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:

[Slice]                                                                         
TasksMax=MAX_TASKS|infinity

Even though we have max_threads set to 61343, things ulimits have a
different limit set, and what this means is the above can end up easily
creating over 1048576 (17 times max_threads) threads all eagerly doing
nothing to just exit, essentially allowing a sort of fork bomb on exit.
Your system may or not fall to its knees.

  Luis

Eric W. Biederman March 1, 2022, 8:50 p.m. UTC | #4

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
>> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > > Following one-liner running inside memcg-limited container consumes
>> > > huge number of host memory and can trigger global OOM.
>> > >
>> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
>> > >
>> > > Patch accounts most part of these allocations and can protect host.
>> > > ---[cut]---
>> > > It is not polished, and perhaps should be splitted.
>> > > obviously it affects other kind of netdevices too.
>> > > Unfortunately I'm not sure that I will have enough time to handle it
>> > properly
>> > > and decided to publish current patch version as is.
>> > > OpenVz workaround it by using per-container limit for number of
>> > > available netdevices, but upstream does not have any kind of
>> > > per-container configuration.
>> > > ------
>> 
>> > Should this just be a new ucount limit on kernel/ucount.c and have veth
>> > use something like inc_ucount(current_user_ns(), current_euid(),
>> > UCOUNT_VETH)?
>> 
>> > This might be abusing ucounts though, not sure, Eric?
>> 
>> 
>> For admins of systems running multiple workloads, there is no easy way
>> to set such limits for each workload.
>
> That's why defaults would exist. Today's ulimits IMHO are insane and
> some are arbitrarily large.

My perspective is that we have two basic kinds of limits.

Limits to catch programs that go out of control hopefully before they
bring down the entire system.  This is the purpose I see of rlimits and
ucounts.  Such limits should be set by default so large that no one has
to care unless their program is broken.

Limits to contain programs and keep them from having a negative impact
on other programs.  Generally this is the role I see the cgroups
playing.  This limits must be much more tightly managed.

The problem with veth that was reported was that the memory cgroup
limits fails to contain veth's allocations and veth manages to affect
process outside the memory cgroup where the veth ``lives''.  The effect
is an OOM but the problem is that it is affecting processes out of the
memory control group.

Part of the reason for the recent ucount work is so that ordinary users
can create user namespaces and root in that user namespace won't be able
to exceed the limits that were set when the user namespace was created
by creating additional users.

Part of the reason for my ucount work is my frustration that cgroups
would up something completely different than what was originally
proposed and solve a rather problem set.  Originally the proposal was
that cgroups would be the user interface for the bean-counter patches.
(Roughly counts like the ucounts are now).  Except for maybe the pid
controller you mention below cgroups look nothing like that today.
So I went and I solved the original problem because it was still not
solved.

The network stack should already have packet limits to prevent a global
OOM so I am a bit curious why those limits aren't preventing a global
OOM in for the veth device.

I am not saying that the patch is correct (although from 10,000 feet the
patch sounds like it is solving the reported problem).  I am answering
the question of how I understand limits to work.

Luis does this explanation of how limits work help?

>> From admin's perspective it is preferred to have minimal
>> knobs to set and if these objects are charged to memcg then the memcg
>> limits would limit them. There was similar situation for inotify
>> instances where fs sysctl inotify/max_user_instances already limits the
>> inotify instances but we memcg charged them to not worry about setting
>> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
>> instances to kmemcg").
>
> Yes but we want sensible defaults out of the box. What those should be
> IMHO might be work which needs to be figured out well.
>
> IMHO today's ulimits are a bit over the top today. This is off slightly
> off topic but for instance play with:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make -j 8
> echo 0 > /proc/sys/vm/oom_dump_tasks                                            
> i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000;  sleep 1; let i=$i+1; done
>
> If you see:
>
> [  217.798124] cgroup: fork rejected by pids controller in
> /user.slice/user-1000.slice/session-1.scope
>                                                                                 
> Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
>
> [Slice]                                                                         
> TasksMax=MAX_TASKS|infinity
>
> Even though we have max_threads set to 61343, things ulimits have a
> different limit set, and what this means is the above can end up easily
> creating over 1048576 (17 times max_threads) threads all eagerly doing
> nothing to just exit, essentially allowing a sort of fork bomb on exit.
> Your system may or not fall to its knees.

What max_threads are you talking about here?  The global max_threads
exposed in /proc/sys/kernel/threads-max?  I don't see how you can get
around that.  Especially since the count is not decremented until the
process is reaped.

Or is this the pids controller having a low limit and
/proc/sys/kernel/threads-max having a higher limit?

I really have not looked at this pids controller.

So I am not certain I understand your example here but I hope I have
answered your question.

Eric

Luis Chamberlain March 1, 2022, 9:25 p.m. UTC | #5

On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
> 
> > On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
> >> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> >> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> >> > > Following one-liner running inside memcg-limited container consumes
> >> > > huge number of host memory and can trigger global OOM.
> >> > >
> >> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> >> > >
> >> > > Patch accounts most part of these allocations and can protect host.
> >> > > ---[cut]---
> >> > > It is not polished, and perhaps should be splitted.
> >> > > obviously it affects other kind of netdevices too.
> >> > > Unfortunately I'm not sure that I will have enough time to handle it
> >> > properly
> >> > > and decided to publish current patch version as is.
> >> > > OpenVz workaround it by using per-container limit for number of
> >> > > available netdevices, but upstream does not have any kind of
> >> > > per-container configuration.
> >> > > ------
> >> 
> >> > Should this just be a new ucount limit on kernel/ucount.c and have veth
> >> > use something like inc_ucount(current_user_ns(), current_euid(),
> >> > UCOUNT_VETH)?
> >> 
> >> > This might be abusing ucounts though, not sure, Eric?
> >> 
> >> 
> >> For admins of systems running multiple workloads, there is no easy way
> >> to set such limits for each workload.
> >
> > That's why defaults would exist. Today's ulimits IMHO are insane and
> > some are arbitrarily large.
> 
> My perspective is that we have two basic kinds of limits.
> 
> Limits to catch programs that go out of control hopefully before they
> bring down the entire system.  This is the purpose I see of rlimits and
> ucounts.  Such limits should be set by default so large that no one has
> to care unless their program is broken.
> 
> Limits to contain programs and keep them from having a negative impact
> on other programs.  Generally this is the role I see the cgroups
> playing.  This limits must be much more tightly managed.
> 
> The problem with veth that was reported was that the memory cgroup
> limits fails to contain veth's allocations and veth manages to affect
> process outside the memory cgroup where the veth ``lives''.  The effect
> is an OOM but the problem is that it is affecting processes out of the
> memory control group.

Given no upper bound was used in the commit log, it seems to have
presented a use case where both types of limits might need to be
considered though.

> Part of the reason for the recent ucount work is so that ordinary users
> can create user namespaces and root in that user namespace won't be able
> to exceed the limits that were set when the user namespace was created
> by creating additional users.

Got it.

> Part of the reason for my ucount work is my frustration that cgroups
> would up something completely different than what was originally
> proposed and solve a rather problem set.  Originally the proposal was
> that cgroups would be the user interface for the bean-counter patches.
> (Roughly counts like the ucounts are now).  Except for maybe the pid
> controller you mention below cgroups look nothing like that today.
> So I went and I solved the original problem because it was still not
> solved.

I see...

> The network stack should already have packet limits to prevent a global
> OOM so I am a bit curious why those limits aren't preventing a global
> OOM in for the veth device.

No packets are used in the demo / commit log, it is just creating
tons of veths that OOMs.

> I am not saying that the patch is correct (although from 10,000 feet the
> patch sounds like it is solving the reported problem).

From your description, it would like it is indeed a right approach
to correct memory allocations so that cgroup memory limits are respected.

Outside of that, it still begs the question if the ucounts can/should
be used for something like root in a namespace creating tons of veths
and put a cap to that.

> I am answering
> the question of how I understand limits to work.

It does!

> Luis does this explanation of how limits work help?

Yup thanks!

> >> From admin's perspective it is preferred to have minimal
> >> knobs to set and if these objects are charged to memcg then the memcg
> >> limits would limit them. There was similar situation for inotify
> >> instances where fs sysctl inotify/max_user_instances already limits the
> >> inotify instances but we memcg charged them to not worry about setting
> >> such limits. See ac7b79fd190b ("inotify, memcg: account inotify
> >> instances to kmemcg").
> >
> > Yes but we want sensible defaults out of the box. What those should be
> > IMHO might be work which needs to be figured out well.
> >
> > IMHO today's ulimits are a bit over the top today. This is off slightly
> > off topic but for instance play with:
> >
> > git clone https://github.com/ColinIanKing/stress-ng
> > cd stress-ng
> > make -j 8
> > echo 0 > /proc/sys/vm/oom_dump_tasks                                            
> > i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 --unshare-ops 10000;  sleep 1; let i=$i+1; done
> >
> > If you see:
> >
> > [  217.798124] cgroup: fork rejected by pids controller in
> > /user.slice/user-1000.slice/session-1.scope
> >                                                                                 
> > Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
> >
> > [Slice]                                                                         
> > TasksMax=MAX_TASKS|infinity
> >
> > Even though we have max_threads set to 61343, things ulimits have a
> > different limit set, and what this means is the above can end up easily
> > creating over 1048576 (17 times max_threads) threads all eagerly doing
> > nothing to just exit, essentially allowing a sort of fork bomb on exit.
> > Your system may or not fall to its knees.
> 
> What max_threads are you talking about here?

Sorry for not being clear, in the kernel this is exposed as max_threads

Which is initialize do kernel/fork.c

root@linus-blktests-block ~ # cat /proc/sys/kernel/threads-max
62157

> The global max_threads
> exposed in /proc/sys/kernel/threads-max?  I don't see how you can get
> around that. 

Yeah I was perplexed and I don't think it's just me.

> Especially since the count is not decremented until the
> process is reaped.
>
> Or is this the pids controller having a low limit and
> /proc/sys/kernel/threads-max having a higher limit?

Not sure, I used defailt debian testing with the above change
to /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to
TasksMax=MAX_TASKS|infinity

> I really have not looked at this pids controller.
> 
> So I am not certain I understand your example here but I hope I have
> answered your question.

During experimentation with the above stress-ng test case, I saw tons
of thread just waiting to do exit:

diff --git a/kernel/exit.c b/kernel/exit.c
index 80c4a67d2770..653ca7ebfb58 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -730,11 +730,24 @@ static void check_stack_usage(void)
 static inline void check_stack_usage(void) {}
 #endif
 
+/* Approx more than twice max_threads */
+#define MAX_EXIT_CONCURRENT (1<<17)
+static atomic_t exit_concurrent_max = ATOMIC_INIT(MAX_EXIT_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(exit_wq);
+
 void __noreturn do_exit(long code)
 {
 	struct task_struct *tsk = current;
 	int group_dead;
 
+	if (atomic_dec_if_positive(&exit_concurrent_max) < 0) {
+		pr_warn_ratelimited("exit: exit_concurrent_max (%u) close to 0 (max : %u), throttling...",
+				    atomic_read(&exit_concurrent_max),
+				    MAX_EXIT_CONCURRENT);
+		wait_event(exit_wq,
+			   atomic_dec_if_positive(&exit_concurrent_max) >= 0);
+	}
+
 	/*
 	 * We can get here from a kernel oops, sometimes with preemption off.
 	 * Start by checking for critical errors.
@@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
 
 	lockdep_free_task(tsk);
 	do_task_dead();
+
+	atomic_inc(&exit_concurrent_max);
+	wake_up(&exit_wq);
 }
 EXPORT_SYMBOL_GPL(do_exit);
 
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 4f5613dac227..980ffaba1ac5 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
 		long max;
 		tns = iter->ns;
 		max = READ_ONCE(tns->ucount_max[type]);
+		if (atomic_long_read(&iter->ucount[type]) > max/16)
+			cond_resched();
 		if (!atomic_long_inc_below(&iter->ucount[type], max))
 			goto fail;
 	}

Luis Chamberlain March 1, 2022, 9:31 p.m. UTC | #6

On Tue, Mar 01, 2022 at 01:25:16PM -0800, Luis Chamberlain wrote:
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 4f5613dac227..980ffaba1ac5 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>  		long max;
>  		tns = iter->ns;
>  		max = READ_ONCE(tns->ucount_max[type]);
> +		if (atomic_long_read(&iter->ucount[type]) > max/16)
> +			cond_resched();
>  		if (!atomic_long_inc_below(&iter->ucount[type], max))
>  			goto fail;

You can of course ignore this, it was just a hack to try to avoid
a soft lockup on the workqueues.

  Luis

King, Colin March 2, 2022, 1:30 p.m. UTC | #7

Just to note that stress-ng does attempt to set the maximum ulimits across all ulimits before it invokes a stressor to try and stress the system as much as possible. It also changes the per-stressor oom adjust setting to make stressors less OOMable, one can disable this with --no-oom-adjust and one can force stressors from being restarted on an OOM with the --oomable option.

-----Original Message-----
From: Eric W. Biederman <ebiederm@xmission.com> 
Sent: 01 March 2022 20:50
To: Luis Chamberlain <mcgrof@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>; Colin Ian King <colin.king@canonical.com>; NeilBrown <neilb@suse.de>; Vasily Averin <vvs@virtuozzo.com>; Vlastimil Babka <vbabka@suse.cz>; Hocko, Michal <mhocko@suse.com>; Roman Gushchin <roman.gushchin@linux.dev>; Linux MM <linux-mm@kvack.org>; netdev@vger.kernel.org; David S. Miller <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>; Tejun Heo <tj@kernel.org>; Greg Kroah-Hartman <gregkh@linuxfoundation.org>; Eric Dumazet <edumazet@google.com>; Kees Cook <keescook@chromium.org>; Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>; David Ahern <dsahern@kernel.org>; linux-kernel@vger.kernel.org; kernel@openvz.org
Subject: Re: [PATCH RFC] net: memcg accounting for veth devices

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
>> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > > Following one-liner running inside memcg-limited container 
>> > > consumes huge number of host memory and can trigger global OOM.
>> > >
>> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; 
>> > > done
>> > >
>> > > Patch accounts most part of these allocations and can protect host.
>> > > ---[cut]---
>> > > It is not polished, and perhaps should be splitted.
>> > > obviously it affects other kind of netdevices too.
>> > > Unfortunately I'm not sure that I will have enough time to handle 
>> > > it
>> > properly
>> > > and decided to publish current patch version as is.
>> > > OpenVz workaround it by using per-container limit for number of 
>> > > available netdevices, but upstream does not have any kind of 
>> > > per-container configuration.
>> > > ------
>> 
>> > Should this just be a new ucount limit on kernel/ucount.c and have 
>> > veth use something like inc_ucount(current_user_ns(), 
>> > current_euid(), UCOUNT_VETH)?
>> 
>> > This might be abusing ucounts though, not sure, Eric?
>> 
>> 
>> For admins of systems running multiple workloads, there is no easy 
>> way to set such limits for each workload.
>
> That's why defaults would exist. Today's ulimits IMHO are insane and 
> some are arbitrarily large.

My perspective is that we have two basic kinds of limits.

Limits to catch programs that go out of control hopefully before they bring down the entire system.  This is the purpose I see of rlimits and ucounts.  Such limits should be set by default so large that no one has to care unless their program is broken.

Limits to contain programs and keep them from having a negative impact on other programs.  Generally this is the role I see the cgroups playing.  This limits must be much more tightly managed.

The problem with veth that was reported was that the memory cgroup limits fails to contain veth's allocations and veth manages to affect process outside the memory cgroup where the veth ``lives''.  The effect is an OOM but the problem is that it is affecting processes out of the memory control group.

Part of the reason for the recent ucount work is so that ordinary users can create user namespaces and root in that user namespace won't be able to exceed the limits that were set when the user namespace was created by creating additional users.

Part of the reason for my ucount work is my frustration that cgroups would up something completely different than what was originally proposed and solve a rather problem set.  Originally the proposal was that cgroups would be the user interface for the bean-counter patches.
(Roughly counts like the ucounts are now).  Except for maybe the pid controller you mention below cgroups look nothing like that today.
So I went and I solved the original problem because it was still not solved.

The network stack should already have packet limits to prevent a global OOM so I am a bit curious why those limits aren't preventing a global OOM in for the veth device.

I am not saying that the patch is correct (although from 10,000 feet the patch sounds like it is solving the reported problem).  I am answering the question of how I understand limits to work.

Luis does this explanation of how limits work help?

>> From admin's perspective it is preferred to have minimal knobs to set 
>> and if these objects are charged to memcg then the memcg limits would 
>> limit them. There was similar situation for inotify instances where 
>> fs sysctl inotify/max_user_instances already limits the inotify 
>> instances but we memcg charged them to not worry about setting such 
>> limits. See ac7b79fd190b ("inotify, memcg: account inotify instances 
>> to kmemcg").
>
> Yes but we want sensible defaults out of the box. What those should be 
> IMHO might be work which needs to be figured out well.
>
> IMHO today's ulimits are a bit over the top today. This is off 
> slightly off topic but for instance play with:
>
> git clone https://github.com/ColinIanKing/stress-ng
> cd stress-ng
> make -j 8
> echo 0 > /proc/sys/vm/oom_dump_tasks                                            
> i=1; while true; do echo "RUNNING TEST $i"; ./stress-ng --unshare 8192 
> --unshare-ops 10000;  sleep 1; let i=$i+1; done
>
> If you see:
>
> [  217.798124] cgroup: fork rejected by pids controller in 
> /user.slice/user-1000.slice/session-1.scope
>                                                                                 
> Edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf to be:
>
> [Slice]                                                                         
> TasksMax=MAX_TASKS|infinity
>
> Even though we have max_threads set to 61343, things ulimits have a 
> different limit set, and what this means is the above can end up 
> easily creating over 1048576 (17 times max_threads) threads all 
> eagerly doing nothing to just exit, essentially allowing a sort of fork bomb on exit.
> Your system may or not fall to its knees.

What max_threads are you talking about here?  The global max_threads exposed in /proc/sys/kernel/threads-max?  I don't see how you can get around that.  Especially since the count is not decremented until the process is reaped.

Or is this the pids controller having a low limit and /proc/sys/kernel/threads-max having a higher limit?

I really have not looked at this pids controller.

So I am not certain I understand your example here but I hope I have answered your question.

Eric

Eric W. Biederman March 2, 2022, 2:43 p.m. UTC | #8

Luis Chamberlain <mcgrof@kernel.org> writes:

> On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
>> I really have not looked at this pids controller.
>> 
>> So I am not certain I understand your example here but I hope I have
>> answered your question.
>
> During experimentation with the above stress-ng test case, I saw tons
> of thread just waiting to do exit:

You increment the count of concurrent threads after a no return function
in do_exit.  Since the increment is never reached the count always goes
down and eventually the warning prints.

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 80c4a67d2770..653ca7ebfb58 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -730,11 +730,24 @@ static void check_stack_usage(void)
>  static inline void check_stack_usage(void) {}
>  #endif
>  
> +/* Approx more than twice max_threads */
> +#define MAX_EXIT_CONCURRENT (1<<17)
> +static atomic_t exit_concurrent_max = ATOMIC_INIT(MAX_EXIT_CONCURRENT);
> +static DECLARE_WAIT_QUEUE_HEAD(exit_wq);
> +
>  void __noreturn do_exit(long code)
>  {
>  	struct task_struct *tsk = current;
>  	int group_dead;
>  
> +	if (atomic_dec_if_positive(&exit_concurrent_max) < 0) {
> +		pr_warn_ratelimited("exit: exit_concurrent_max (%u) close to 0 (max : %u), throttling...",
> +				    atomic_read(&exit_concurrent_max),
> +				    MAX_EXIT_CONCURRENT);
> +		wait_event(exit_wq,
> +			   atomic_dec_if_positive(&exit_concurrent_max) >= 0);
> +	}
> +
>  	/*
>  	 * We can get here from a kernel oops, sometimes with preemption off.
>  	 * Start by checking for critical errors.
> @@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
>  
>  	lockdep_free_task(tsk);
>  	do_task_dead();

The function do_task_dead never returns.

> +
> +	atomic_inc(&exit_concurrent_max);
> +	wake_up(&exit_wq);
>  }
>  EXPORT_SYMBOL_GPL(do_exit);
>  
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 4f5613dac227..980ffaba1ac5 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -238,6 +238,8 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>  		long max;
>  		tns = iter->ns;
>  		max = READ_ONCE(tns->ucount_max[type]);
> +		if (atomic_long_read(&iter->ucount[type]) > max/16)
> +			cond_resched();
>  		if (!atomic_long_inc_below(&iter->ucount[type], max))
>  			goto fail;
>  	}

Eric

Luis Chamberlain March 2, 2022, 9:52 p.m. UTC | #9

On Wed, Mar 02, 2022 at 08:43:54AM -0600, Eric W. Biederman wrote:
> Luis Chamberlain <mcgrof@kernel.org> writes:
> 
> > On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
> >> I really have not looked at this pids controller.
> >> 
> >> So I am not certain I understand your example here but I hope I have
> >> answered your question.
> >
> > During experimentation with the above stress-ng test case, I saw tons
> > of thread just waiting to do exit:
> 
> You increment the count of concurrent threads after a no return function
> in do_exit.  Since the increment is never reached the count always goes
> down and eventually the warning prints.
> 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 80c4a67d2770..653ca7ebfb58 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -881,6 +894,9 @@ void __noreturn do_exit(long code)
> >  
> >  	lockdep_free_task(tsk);
> >  	do_task_dead();
> 
> The function do_task_dead never returns.
> 
> > +
> > +	atomic_inc(&exit_concurrent_max);
> > +	wake_up(&exit_wq);
> >  }
> >  EXPORT_SYMBOL_GPL(do_exit);

Doh thanks!

  Luis

Vasily Averin April 11, 2022, 9:40 a.m. UTC | #10

On 3/1/22 21:09, Shakeel Butt wrote:
> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
>> On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
>> > Following one-liner running inside memcg-limited container consumes
>> > huge number of host memory and can trigger global OOM.
>> >
>> > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
>> >
>> > Patch accounts most part of these allocations and can protect host.
>> > ---[cut]---
>> > It is not polished, and perhaps should be splitted.
>> > obviously it affects other kind of netdevices too.
>> > Unfortunately I'm not sure that I will have enough time to handle it properly
>> > and decided to publish current patch version as is.
>> > OpenVz workaround it by using per-container limit for number of
>> > available netdevices, but upstream does not have any kind of
>> > per-container configuration.
>> > ------

I've noticed that __register_pernet_operations() executes init hook of registered 
pernet_operation structure in all found net namespaces.

Usually these hooks are called by process related to specified net namespace,
and all marked allocation are accounted to related container:
i.e. objects related to netns in container A are accounted to memcg of container A,
objects allocated inside container B are accounted to corresponding memcg B,
and so on.

However __register_pernet_operations() calls the same hooks in one context,
and as result all marked allocations are accounted to one memcg.
It is quite rare scenario, however current processing looks incorrect for me.

I expect we can take memcg from 'struct net', because of this structure is accounted per se.
then we can use set_active_memcg() before init hook execution.
However I'm not sure it is fully correct.

Could you please advise some better solution?

Thank you,
	Vasily Averin

[RFC] net: memcg accounting for veth devices

Checks

Commit Message

Comments

Patch