netfilter: account ebt_table_info to kmemcg

Message ID	20181229015524.222741-1-shakeelb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of 3mtqmxagkcf0nc5f99g6bjjbg9.7jhgdips-hhfq57f.jmb@flex--shakeelb.bounces.google.com designates 209.85.220.73 as permitted sender) client-ip=209.85.220.73; Date: Fri, 28 Dec 2018 17:55:24 -0800 Message-Id: <20181229015524.222741-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [PATCH] netfilter: account ebt_table_info to kmemcg From: Shakeel Butt <shakeelb@google.com> To: Pablo Neira Ayuso <pablo@netfilter.org>, Florian Westphal <fw@strlen.de>, Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>, Roopa Prabhu <roopa@cumulusnetworks.com>, Nikolay Aleksandrov <nikolay@cumulusnetworks.com>, Michal Hocko <mhocko@suse.com>, Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, netfilter-devel@vger.kernel.org, coreteam@netfilter.org, bridge@lists.linux-foundation.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com>, syzbot+7713f3aa67be76b1552c@syzkaller.appspotmail.com Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	netfilter: account ebt_table_info to kmemcg \| expand netfilter: account ebt_table_info to kmemcg

Shakeel Butt Dec. 29, 2018, 1:55 a.m. UTC

The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
memory is already accounted to kmemcg. Do the same for ebtables. The
syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
whole system from a restricted memcg, a potential DoS.

Reported-by: syzbot+7713f3aa67be76b1552c@syzkaller.appspotmail.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 net/bridge/netfilter/ebtables.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Michal Hocko Dec. 29, 2018, 7:33 a.m. UTC | #1

On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> memory is already accounted to kmemcg. Do the same for ebtables. The
> syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> whole system from a restricted memcg, a potential DoS.

What is the lifetime of these objects? Are they bound to any process?

> Reported-by: syzbot+7713f3aa67be76b1552c@syzkaller.appspotmail.com
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  net/bridge/netfilter/ebtables.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
> index 491828713e0b..5e55cef0cec3 100644
> --- a/net/bridge/netfilter/ebtables.c
> +++ b/net/bridge/netfilter/ebtables.c
> @@ -1137,14 +1137,16 @@ static int do_replace(struct net *net, const void __user *user,
>  	tmp.name[sizeof(tmp.name) - 1] = 0;
>  
>  	countersize = COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids;
> -	newinfo = vmalloc(sizeof(*newinfo) + countersize);
> +	newinfo = __vmalloc(sizeof(*newinfo) + countersize, GFP_KERNEL_ACCOUNT,
> +			    PAGE_KERNEL);
>  	if (!newinfo)
>  		return -ENOMEM;
>  
>  	if (countersize)
>  		memset(newinfo->counters, 0, countersize);
>  
> -	newinfo->entries = vmalloc(tmp.entries_size);
> +	newinfo->entries = __vmalloc(tmp.entries_size, GFP_KERNEL_ACCOUNT,
> +				     PAGE_KERNEL);
>  	if (!newinfo->entries) {
>  		ret = -ENOMEM;
>  		goto free_newinfo;
> -- 
> 2.20.1.415.g653613c723-goog
>

Florian Westphal Dec. 29, 2018, 9:52 a.m. UTC | #2

Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > memory is already accounted to kmemcg. Do the same for ebtables. The
> > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > whole system from a restricted memcg, a potential DoS.
> 
> What is the lifetime of these objects? Are they bound to any process?

No, they are not.
They are free'd only when userspace requests it or the netns is
destroyed.

Kirill Tkhai Dec. 29, 2018, 9:52 a.m. UTC | #3

Hi, Michal!

On 29.12.2018 10:33, Michal Hocko wrote:
> On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
>> The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
>> memory is already accounted to kmemcg. Do the same for ebtables. The
>> syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
>> whole system from a restricted memcg, a potential DoS.
> 
> What is the lifetime of these objects? Are they bound to any process?

These are list of ebtables rules, which may be displayed with $ebtables-save command.
In case of we do not account them, a low priority container may eat all the memory
and OOM killer in berserk mode will kill all the processes on machine. They are not bound
to any process, but they are bound to network namespace.

OOM killer does not analyze such the memory cgroup-related allocations, since it
is task-aware only. Maybe we should do it namespace-aware too...

Kirill

>> Reported-by: syzbot+7713f3aa67be76b1552c@syzkaller.appspotmail.com
>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
>> ---
>>  net/bridge/netfilter/ebtables.c | 6 ++++--
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
>> index 491828713e0b..5e55cef0cec3 100644
>> --- a/net/bridge/netfilter/ebtables.c
>> +++ b/net/bridge/netfilter/ebtables.c
>> @@ -1137,14 +1137,16 @@ static int do_replace(struct net *net, const void __user *user,
>>  	tmp.name[sizeof(tmp.name) - 1] = 0;
>>  
>>  	countersize = COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids;
>> -	newinfo = vmalloc(sizeof(*newinfo) + countersize);
>> +	newinfo = __vmalloc(sizeof(*newinfo) + countersize, GFP_KERNEL_ACCOUNT,
>> +			    PAGE_KERNEL);
>>  	if (!newinfo)
>>  		return -ENOMEM;
>>  
>>  	if (countersize)
>>  		memset(newinfo->counters, 0, countersize);
>>  
>> -	newinfo->entries = vmalloc(tmp.entries_size);
>> +	newinfo->entries = __vmalloc(tmp.entries_size, GFP_KERNEL_ACCOUNT,
>> +				     PAGE_KERNEL);
>>  	if (!newinfo->entries) {
>>  		ret = -ENOMEM;
>>  		goto free_newinfo;
>> -- 
>> 2.20.1.415.g653613c723-goog
>>
>

Michal Hocko Dec. 29, 2018, 10:06 a.m. UTC | #4

On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > whole system from a restricted memcg, a potential DoS.
> > 
> > What is the lifetime of these objects? Are they bound to any process?
> 
> No, they are not.
> They are free'd only when userspace requests it or the netns is
> destroyed.

Then this is problematic, because the oom killer is not able to
guarantee the hard limit and so the excessive memory consumption cannot
be really contained. As a result the memcg will be basically useless
until somebody tears down the charged objects by other means. The memcg
oom killer will surely kill all the existing tasks in the cgroup and
this could somehow reduce the problem. Maybe this is sufficient for
some usecases but that should be properly analyzed and described in the
changelog.

Shakeel Butt Dec. 29, 2018, 7:34 p.m. UTC | #5

On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > whole system from a restricted memcg, a potential DoS.
> > >
> > > What is the lifetime of these objects? Are they bound to any process?
> >
> > No, they are not.
> > They are free'd only when userspace requests it or the netns is
> > destroyed.
>
> Then this is problematic, because the oom killer is not able to
> guarantee the hard limit and so the excessive memory consumption cannot
> be really contained. As a result the memcg will be basically useless
> until somebody tears down the charged objects by other means. The memcg
> oom killer will surely kill all the existing tasks in the cgroup and
> this could somehow reduce the problem. Maybe this is sufficient for
> some usecases but that should be properly analyzed and described in the
> changelog.
>

Can you explain why you think the memcg hard limit will not be
enforced? From what I understand, the memcg oom-killer will kill the
allocating processes as you have mentioned. We do force charging for
very limited conditions but here the memcg oom-killer will take care
of

Anyways, the kernel is already charging the memory for
[ip,ip6,arp]_tables and this patch adds the charging for ebtables.
Without this patch, as Kirill has described and shown by syzbot, a low
priority memcg can force system OOM.

Shakeel

Shakeel Butt Dec. 29, 2018, 7:39 p.m. UTC | #6

Hi Kirill,

On Sat, Dec 29, 2018 at 1:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> Hi, Michal!
>
> On 29.12.2018 10:33, Michal Hocko wrote:
> > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> >> The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> >> memory is already accounted to kmemcg. Do the same for ebtables. The
> >> syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> >> whole system from a restricted memcg, a potential DoS.
> >
> > What is the lifetime of these objects? Are they bound to any process?
>
> These are list of ebtables rules, which may be displayed with $ebtables-save command.
> In case of we do not account them, a low priority container may eat all the memory
> and OOM killer in berserk mode will kill all the processes on machine. They are not bound
> to any process, but they are bound to network namespace.
>
> OOM killer does not analyze such the memory cgroup-related allocations, since it
> is task-aware only. Maybe we should do it namespace-aware too...

This is a good idea. I am already brainstorming on a somewhat similar
idea to make shmem/tmpfs files oom-killable. I will share once I have
something more concrete and will think on namespace angle too.

thanks,
Shakeel

Michal Hocko Dec. 30, 2018, 7:45 a.m. UTC | #7

On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > whole system from a restricted memcg, a potential DoS.
> > > >
> > > > What is the lifetime of these objects? Are they bound to any process?
> > >
> > > No, they are not.
> > > They are free'd only when userspace requests it or the netns is
> > > destroyed.
> >
> > Then this is problematic, because the oom killer is not able to
> > guarantee the hard limit and so the excessive memory consumption cannot
> > be really contained. As a result the memcg will be basically useless
> > until somebody tears down the charged objects by other means. The memcg
> > oom killer will surely kill all the existing tasks in the cgroup and
> > this could somehow reduce the problem. Maybe this is sufficient for
> > some usecases but that should be properly analyzed and described in the
> > changelog.
> >
> 
> Can you explain why you think the memcg hard limit will not be
> enforced? From what I understand, the memcg oom-killer will kill the
> allocating processes as you have mentioned. We do force charging for
> very limited conditions but here the memcg oom-killer will take care
> of

I was talking about the force charge part. Depending on a specific
allocation and its life time this can gradually get us over hard limit
without any bound theoretically.

> Anyways, the kernel is already charging the memory for
> [ip,ip6,arp]_tables and this patch adds the charging for ebtables.
> Without this patch, as Kirill has described and shown by syzbot, a low
> priority memcg can force system OOM.

I am not opposing the patch per-se. I would just like the changelog to
be more descriptive about the life time and consequences.

Michal Hocko Dec. 30, 2018, 8 a.m. UTC | #8

On Sun 30-12-18 08:45:13, Michal Hocko wrote:
> On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > whole system from a restricted memcg, a potential DoS.
> > > > >
> > > > > What is the lifetime of these objects? Are they bound to any process?
> > > >
> > > > No, they are not.
> > > > They are free'd only when userspace requests it or the netns is
> > > > destroyed.
> > >
> > > Then this is problematic, because the oom killer is not able to
> > > guarantee the hard limit and so the excessive memory consumption cannot
> > > be really contained. As a result the memcg will be basically useless
> > > until somebody tears down the charged objects by other means. The memcg
> > > oom killer will surely kill all the existing tasks in the cgroup and
> > > this could somehow reduce the problem. Maybe this is sufficient for
> > > some usecases but that should be properly analyzed and described in the
> > > changelog.
> > >
> > 
> > Can you explain why you think the memcg hard limit will not be
> > enforced? From what I understand, the memcg oom-killer will kill the
> > allocating processes as you have mentioned. We do force charging for
> > very limited conditions but here the memcg oom-killer will take care
> > of
> 
> I was talking about the force charge part. Depending on a specific
> allocation and its life time this can gradually get us over hard limit
> without any bound theoretically.

Forgot to mention. Since b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed"") there is no way to bail out from the
vmalloc allocation loop so if the request is really large then the memcg
oom will not help. Is that a problem here?

Maybe it is time to revisit fatal_signal_pending check.

Shakeel Butt Dec. 31, 2018, 3:59 a.m. UTC | #9

On Sun, Dec 30, 2018 at 12:00 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 30-12-18 08:45:13, Michal Hocko wrote:
> > On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > > whole system from a restricted memcg, a potential DoS.
> > > > > >
> > > > > > What is the lifetime of these objects? Are they bound to any process?
> > > > >
> > > > > No, they are not.
> > > > > They are free'd only when userspace requests it or the netns is
> > > > > destroyed.
> > > >
> > > > Then this is problematic, because the oom killer is not able to
> > > > guarantee the hard limit and so the excessive memory consumption cannot
> > > > be really contained. As a result the memcg will be basically useless
> > > > until somebody tears down the charged objects by other means. The memcg
> > > > oom killer will surely kill all the existing tasks in the cgroup and
> > > > this could somehow reduce the problem. Maybe this is sufficient for
> > > > some usecases but that should be properly analyzed and described in the
> > > > changelog.
> > > >
> > >
> > > Can you explain why you think the memcg hard limit will not be
> > > enforced? From what I understand, the memcg oom-killer will kill the
> > > allocating processes as you have mentioned. We do force charging for
> > > very limited conditions but here the memcg oom-killer will take care
> > > of
> >
> > I was talking about the force charge part. Depending on a specific
> > allocation and its life time this can gradually get us over hard limit
> > without any bound theoretically.
>
> Forgot to mention. Since b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed"") there is no way to bail out from the
> vmalloc allocation loop so if the request is really large then the memcg
> oom will not help. Is that a problem here?
>

Yes, I think it will be an issue here.

> Maybe it is time to revisit fatal_signal_pending check.

Yes, we will need something to handle the memcg OOM. I will think more
on that front or if you have any ideas, please do propose.

thanks,
Shakeel

Shakeel Butt Dec. 31, 2018, 4 a.m. UTC | #10

On Sat, Dec 29, 2018 at 11:45 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > whole system from a restricted memcg, a potential DoS.
> > > > >
> > > > > What is the lifetime of these objects? Are they bound to any process?
> > > >
> > > > No, they are not.
> > > > They are free'd only when userspace requests it or the netns is
> > > > destroyed.
> > >
> > > Then this is problematic, because the oom killer is not able to
> > > guarantee the hard limit and so the excessive memory consumption cannot
> > > be really contained. As a result the memcg will be basically useless
> > > until somebody tears down the charged objects by other means. The memcg
> > > oom killer will surely kill all the existing tasks in the cgroup and
> > > this could somehow reduce the problem. Maybe this is sufficient for
> > > some usecases but that should be properly analyzed and described in the
> > > changelog.
> > >
> >
> > Can you explain why you think the memcg hard limit will not be
> > enforced? From what I understand, the memcg oom-killer will kill the
> > allocating processes as you have mentioned. We do force charging for
> > very limited conditions but here the memcg oom-killer will take care
> > of
>
> I was talking about the force charge part. Depending on a specific
> allocation and its life time this can gradually get us over hard limit
> without any bound theoretically.
>
> > Anyways, the kernel is already charging the memory for
> > [ip,ip6,arp]_tables and this patch adds the charging for ebtables.
> > Without this patch, as Kirill has described and shown by syzbot, a low
> > priority memcg can force system OOM.
>
> I am not opposing the patch per-se. I would just like the changelog to
> be more descriptive about the life time and consequences.
> --

I will resend the patch with more detailed change log.

thanks,
Shakeel

Michal Hocko Dec. 31, 2018, 10:11 a.m. UTC | #11

On Sun 30-12-18 19:59:53, Shakeel Butt wrote:
> On Sun, Dec 30, 2018 at 12:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sun 30-12-18 08:45:13, Michal Hocko wrote:
> > > On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > > > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > > > whole system from a restricted memcg, a potential DoS.
> > > > > > >
> > > > > > > What is the lifetime of these objects? Are they bound to any process?
> > > > > >
> > > > > > No, they are not.
> > > > > > They are free'd only when userspace requests it or the netns is
> > > > > > destroyed.
> > > > >
> > > > > Then this is problematic, because the oom killer is not able to
> > > > > guarantee the hard limit and so the excessive memory consumption cannot
> > > > > be really contained. As a result the memcg will be basically useless
> > > > > until somebody tears down the charged objects by other means. The memcg
> > > > > oom killer will surely kill all the existing tasks in the cgroup and
> > > > > this could somehow reduce the problem. Maybe this is sufficient for
> > > > > some usecases but that should be properly analyzed and described in the
> > > > > changelog.
> > > > >
> > > >
> > > > Can you explain why you think the memcg hard limit will not be
> > > > enforced? From what I understand, the memcg oom-killer will kill the
> > > > allocating processes as you have mentioned. We do force charging for
> > > > very limited conditions but here the memcg oom-killer will take care
> > > > of
> > >
> > > I was talking about the force charge part. Depending on a specific
> > > allocation and its life time this can gradually get us over hard limit
> > > without any bound theoretically.
> >
> > Forgot to mention. Since b8c8a338f75e ("Revert "vmalloc: back off when
> > the current task is killed"") there is no way to bail out from the
> > vmalloc allocation loop so if the request is really large then the memcg
> > oom will not help. Is that a problem here?
> >
> 
> Yes, I think it will be an issue here.
> 
> > Maybe it is time to revisit fatal_signal_pending check.
> 
> Yes, we will need something to handle the memcg OOM. I will think more
> on that front or if you have any ideas, please do propose.

I can see three options here:
	- do not force charge on memcg oom or introduce a limited charge
	  overflow (reserves basically).
	- revert the revert and reintroduce the fatal_signal_pending
	  check into vmalloc
	- be more specific and check tsk_is_oom_victim in vmalloc and
	  fail

Shakeel Butt Jan. 3, 2019, 8:52 p.m. UTC | #12

On Mon, Dec 31, 2018 at 2:12 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sun 30-12-18 19:59:53, Shakeel Butt wrote:
> > On Sun, Dec 30, 2018 at 12:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Sun 30-12-18 08:45:13, Michal Hocko wrote:
> > > > On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > > > > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > > > > whole system from a restricted memcg, a potential DoS.
> > > > > > > >
> > > > > > > > What is the lifetime of these objects? Are they bound to any process?
> > > > > > >
> > > > > > > No, they are not.
> > > > > > > They are free'd only when userspace requests it or the netns is
> > > > > > > destroyed.
> > > > > >
> > > > > > Then this is problematic, because the oom killer is not able to
> > > > > > guarantee the hard limit and so the excessive memory consumption cannot
> > > > > > be really contained. As a result the memcg will be basically useless
> > > > > > until somebody tears down the charged objects by other means. The memcg
> > > > > > oom killer will surely kill all the existing tasks in the cgroup and
> > > > > > this could somehow reduce the problem. Maybe this is sufficient for
> > > > > > some usecases but that should be properly analyzed and described in the
> > > > > > changelog.
> > > > > >
> > > > >
> > > > > Can you explain why you think the memcg hard limit will not be
> > > > > enforced? From what I understand, the memcg oom-killer will kill the
> > > > > allocating processes as you have mentioned. We do force charging for
> > > > > very limited conditions but here the memcg oom-killer will take care
> > > > > of
> > > >
> > > > I was talking about the force charge part. Depending on a specific
> > > > allocation and its life time this can gradually get us over hard limit
> > > > without any bound theoretically.
> > >
> > > Forgot to mention. Since b8c8a338f75e ("Revert "vmalloc: back off when
> > > the current task is killed"") there is no way to bail out from the
> > > vmalloc allocation loop so if the request is really large then the memcg
> > > oom will not help. Is that a problem here?
> > >
> >
> > Yes, I think it will be an issue here.
> >
> > > Maybe it is time to revisit fatal_signal_pending check.
> >
> > Yes, we will need something to handle the memcg OOM. I will think more
> > on that front or if you have any ideas, please do propose.
>
> I can see three options here:
>         - do not force charge on memcg oom or introduce a limited charge
>           overflow (reserves basically).
>         - revert the revert and reintroduce the fatal_signal_pending
>           check into vmalloc
>         - be more specific and check tsk_is_oom_victim in vmalloc and
>           fail
>

I think for the long term solution we might need something similar to
memcg oom reserves (1) but for quick fix I think we can do the
combination of (2) and (3).

Shakeel

Michal Hocko Jan. 4, 2019, 1:21 p.m. UTC | #13

On Thu 03-01-19 12:52:54, Shakeel Butt wrote:
> On Mon, Dec 31, 2018 at 2:12 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Sun 30-12-18 19:59:53, Shakeel Butt wrote:
> > > On Sun, Dec 30, 2018 at 12:00 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Sun 30-12-18 08:45:13, Michal Hocko wrote:
> > > > > On Sat 29-12-18 11:34:29, Shakeel Butt wrote:
> > > > > > On Sat, Dec 29, 2018 at 2:06 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > >
> > > > > > > On Sat 29-12-18 10:52:15, Florian Westphal wrote:
> > > > > > > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > > > On Fri 28-12-18 17:55:24, Shakeel Butt wrote:
> > > > > > > > > > The [ip,ip6,arp]_tables use x_tables_info internally and the underlying
> > > > > > > > > > memory is already accounted to kmemcg. Do the same for ebtables. The
> > > > > > > > > > syzbot, by using setsockopt(EBT_SO_SET_ENTRIES), was able to OOM the
> > > > > > > > > > whole system from a restricted memcg, a potential DoS.
> > > > > > > > >
> > > > > > > > > What is the lifetime of these objects? Are they bound to any process?
> > > > > > > >
> > > > > > > > No, they are not.
> > > > > > > > They are free'd only when userspace requests it or the netns is
> > > > > > > > destroyed.
> > > > > > >
> > > > > > > Then this is problematic, because the oom killer is not able to
> > > > > > > guarantee the hard limit and so the excessive memory consumption cannot
> > > > > > > be really contained. As a result the memcg will be basically useless
> > > > > > > until somebody tears down the charged objects by other means. The memcg
> > > > > > > oom killer will surely kill all the existing tasks in the cgroup and
> > > > > > > this could somehow reduce the problem. Maybe this is sufficient for
> > > > > > > some usecases but that should be properly analyzed and described in the
> > > > > > > changelog.
> > > > > > >
> > > > > >
> > > > > > Can you explain why you think the memcg hard limit will not be
> > > > > > enforced? From what I understand, the memcg oom-killer will kill the
> > > > > > allocating processes as you have mentioned. We do force charging for
> > > > > > very limited conditions but here the memcg oom-killer will take care
> > > > > > of
> > > > >
> > > > > I was talking about the force charge part. Depending on a specific
> > > > > allocation and its life time this can gradually get us over hard limit
> > > > > without any bound theoretically.
> > > >
> > > > Forgot to mention. Since b8c8a338f75e ("Revert "vmalloc: back off when
> > > > the current task is killed"") there is no way to bail out from the
> > > > vmalloc allocation loop so if the request is really large then the memcg
> > > > oom will not help. Is that a problem here?
> > > >
> > >
> > > Yes, I think it will be an issue here.
> > >
> > > > Maybe it is time to revisit fatal_signal_pending check.
> > >
> > > Yes, we will need something to handle the memcg OOM. I will think more
> > > on that front or if you have any ideas, please do propose.
> >
> > I can see three options here:
> >         - do not force charge on memcg oom or introduce a limited charge
> >           overflow (reserves basically).
> >         - revert the revert and reintroduce the fatal_signal_pending
> >           check into vmalloc
> >         - be more specific and check tsk_is_oom_victim in vmalloc and
> >           fail
> >
> 
> I think for the long term solution we might need something similar to
> memcg oom reserves (1) but for quick fix I think we can do the
> combination of (2) and (3).

Johannes argued that fatal_signal_pending is too general check for
vmalloc. I would argue that we already break out of some operations on
fatal signals. tsk_is_oom_victim is more subtle but much more targeted
on the other hand.

I do not have any strong preference to be honest but I agree that some
limited reserves would be the best solution long term. I just do not
have any idea how to scale those reserves to be meaningful.

netfilter: account ebt_table_info to kmemcg

Commit Message

Comments

Patch