diff mbox

fs, mm: account filp and names caches to kmemcg

Message ID 20171005222144.123797-1-shakeelb@google.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shakeel Butt Oct. 5, 2017, 10:21 p.m. UTC
The allocations from filp and names kmem caches can be directly
triggered by user space applications. A buggy application can
consume a significant amount of unaccounted system memory. Though
we have not noticed such buggy applications in our production
but upon close inspection, we found that a lot of machines spend
very significant amount of memory on these caches. So, these
caches should be accounted to kmemcg.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 fs/dcache.c     | 2 +-
 fs/file_table.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Michal Hocko Oct. 6, 2017, 7:59 a.m. UTC | #1
On Thu 05-10-17 15:21:44, Shakeel Butt wrote:
> The allocations from filp and names kmem caches can be directly
> triggered by user space applications. A buggy application can
> consume a significant amount of unaccounted system memory. Though
> we have not noticed such buggy applications in our production
> but upon close inspection, we found that a lot of machines spend
> very significant amount of memory on these caches. So, these
> caches should be accounted to kmemcg.
> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  fs/dcache.c     | 2 +-
>  fs/file_table.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index f90141387f01..fb3449161063 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -3642,7 +3642,7 @@ void __init vfs_caches_init_early(void)
>  void __init vfs_caches_init(void)
>  {
>  	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> -			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> +			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);

I might be wrong but isn't name cache only holding temporary objects
used for path resolution which are not stored anywhere?

>  
>  	dcache_init();
>  	inode_init();
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 61517f57f8ef..567888cdf7d3 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -312,7 +312,7 @@ void put_filp(struct file *file)
>  void __init files_init(void)
>  {
>  	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> -			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> +			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>  	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>  }

Don't we have a limit for the maximum number of open files?
Shakeel Butt Oct. 6, 2017, 7:33 p.m. UTC | #2
>>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>
> I might be wrong but isn't name cache only holding temporary objects
> used for path resolution which are not stored anywhere?
>

Even though they're temporary, many containers can together use a
significant amount of transient uncharged memory. We've seen machines
with 100s of MiBs in names_cache.

>>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>>  }
>
> Don't we have a limit for the maximum number of open files?
>

Yes, there is a system limit of maximum number of open files. However
this limit is shared between different users on the system and one
user can hog this resource. To cater that, we set the maximum limit
very high and let the memory limit of each user limit the number of
files they can open.
Michal Hocko Oct. 9, 2017, 6:24 a.m. UTC | #3
On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >
> > I might be wrong but isn't name cache only holding temporary objects
> > used for path resolution which are not stored anywhere?
> >
> 
> Even though they're temporary, many containers can together use a
> significant amount of transient uncharged memory. We've seen machines
> with 100s of MiBs in names_cache.

Yes that might be possible but are we prepared for random ENOMEM from
vfs calls which need to allocate a temporary name?

> 
> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >>  }
> >
> > Don't we have a limit for the maximum number of open files?
> >
> 
> Yes, there is a system limit of maximum number of open files. However
> this limit is shared between different users on the system and one
> user can hog this resource. To cater that, we set the maximum limit
> very high and let the memory limit of each user limit the number of
> files they can open.

Similarly here. Are all syscalls allocating a fd prepared to return
ENOMEM?
Greg Thelen Oct. 9, 2017, 5:52 p.m. UTC | #4
Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>> 
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>
>> 
>> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>> 
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?
>
> -- 
> Michal Hocko
> SUSE Labs

Even before this patch I find memcg oom handling inconsistent.  Page
cache pages trigger oom killer and may allow caller to succeed once the
kernel retries.  But kmem allocations don't call oom killer.  They
surface errors to user space.  This makes memcg hard to use for memory
overcommit because it's desirable for a high priority task to
transparently kill a lower priority task using the memcg oom killer.

A few ideas on how to make it more flexible:

a) Go back to memcg oom killing within memcg charging.  This runs risk
   of oom killing while caller holds locks which oom victim selection or
   oom victim termination may need.  Google's been running this way for
   a while.

b) Have every syscall return do something similar to page fault handler:
   kmem allocations in oom memcg mark the current task as needing an oom
   check return NULL.  If marked oom, syscall exit would use
   mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
   risky.  I doubt every syscall is compatible with such a restart.

c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
   which will oom kill if needed.

Comments?

Demo program which eventually gets ENOSPC from mkdir.

$ cat /tmp/t
while umount /tmp/mnt; do true; done
mkdir -p /tmp/mnt
mount -t tmpfs nodev /tmp/mnt
cd /dev/cgroup/memory
rmdir t
mkdir t
echo 32M > t/memory.limit_in_bytes
(echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs)

$ cat /tmp/mkdirs.c
#include <err.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>

int main()
{
        int i;
        char name[32];

        if (mlockall(MCL_CURRENT|MCL_FUTURE))
                err(1, "mlockall");
        for (i = 0; i < (1<<20); i++) {
                sprintf(name, "%d", i);
                if (mkdir(name, 0700))
                        err(1, "mkdir");
        }
        printf("done\n");
        return 0;
}
Michal Hocko Oct. 9, 2017, 6:04 p.m. UTC | #5
[CC Johannes - the thread starts
http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com]

On Mon 09-10-17 10:52:44, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >> >
> >> > I might be wrong but isn't name cache only holding temporary objects
> >> > used for path resolution which are not stored anywhere?
> >> >
> >> 
> >> Even though they're temporary, many containers can together use a
> >> significant amount of transient uncharged memory. We've seen machines
> >> with 100s of MiBs in names_cache.
> >
> > Yes that might be possible but are we prepared for random ENOMEM from
> > vfs calls which need to allocate a temporary name?
> >
> >> 
> >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >> >>  }
> >> >
> >> > Don't we have a limit for the maximum number of open files?
> >> >
> >> 
> >> Yes, there is a system limit of maximum number of open files. However
> >> this limit is shared between different users on the system and one
> >> user can hog this resource. To cater that, we set the maximum limit
> >> very high and let the memory limit of each user limit the number of
> >> files they can open.
> >
> > Similarly here. Are all syscalls allocating a fd prepared to return
> > ENOMEM?
> >
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> Even before this patch I find memcg oom handling inconsistent.  Page
> cache pages trigger oom killer and may allow caller to succeed once the
> kernel retries.  But kmem allocations don't call oom killer.  They
> surface errors to user space.  This makes memcg hard to use for memory
> overcommit because it's desirable for a high priority task to
> transparently kill a lower priority task using the memcg oom killer.
> 
> A few ideas on how to make it more flexible:
> 
> a) Go back to memcg oom killing within memcg charging.  This runs risk
>    of oom killing while caller holds locks which oom victim selection or
>    oom victim termination may need.  Google's been running this way for
>    a while.
> 
> b) Have every syscall return do something similar to page fault handler:
>    kmem allocations in oom memcg mark the current task as needing an oom
>    check return NULL.  If marked oom, syscall exit would use
>    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
>    risky.  I doubt every syscall is compatible with such a restart.
> 
> c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>    which will oom kill if needed.
> 
> Comments?
> 
> Demo program which eventually gets ENOSPC from mkdir.
> 
> $ cat /tmp/t
> while umount /tmp/mnt; do true; done
> mkdir -p /tmp/mnt
> mount -t tmpfs nodev /tmp/mnt
> cd /dev/cgroup/memory
> rmdir t
> mkdir t
> echo 32M > t/memory.limit_in_bytes
> (echo $BASHPID > t/cgroup.procs && cd /tmp/mnt && exec /tmp/mkdirs)
> 
> $ cat /tmp/mkdirs.c
> #include <err.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> 
> int main()
> {
>         int i;
>         char name[32];
> 
>         if (mlockall(MCL_CURRENT|MCL_FUTURE))
>                 err(1, "mlockall");
>         for (i = 0; i < (1<<20); i++) {
>                 sprintf(name, "%d", i);
>                 if (mkdir(name, 0700))
>                         err(1, "mkdir");
>         }
>         printf("done\n");
>         return 0;
> }
Michal Hocko Oct. 9, 2017, 6:17 p.m. UTC | #6
On Mon 09-10-17 20:04:09, Michal Hocko wrote:
> [CC Johannes - the thread starts
> http://lkml.kernel.org/r/20171005222144.123797-1-shakeelb@google.com]
> 
> On Mon 09-10-17 10:52:44, Greg Thelen wrote:
[...]
> > A few ideas on how to make it more flexible:
> > 
> > a) Go back to memcg oom killing within memcg charging.  This runs risk
> >    of oom killing while caller holds locks which oom victim selection or
> >    oom victim termination may need.  Google's been running this way for
> >    a while.

We can actually reopen this discussion now that the oom handling is
async due to the oom_reaper. At least for the v2 interface. I would have
to think about it much more but the primary concern for this patch was
whether we really need/want to charge short therm objects which do not
outlive a single syscall.
 
> > b) Have every syscall return do something similar to page fault handler:
> >    kmem allocations in oom memcg mark the current task as needing an oom
> >    check return NULL.  If marked oom, syscall exit would use
> >    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
> >    risky.  I doubt every syscall is compatible with such a restart.

yes, this is simply a no go

> > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> >    which will oom kill if needed.

This is what we have max limit for.
Johannes Weiner Oct. 9, 2017, 8:26 p.m. UTC | #7
On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> >> >
> >> > I might be wrong but isn't name cache only holding temporary objects
> >> > used for path resolution which are not stored anywhere?
> >> >
> >> 
> >> Even though they're temporary, many containers can together use a
> >> significant amount of transient uncharged memory. We've seen machines
> >> with 100s of MiBs in names_cache.
> >
> > Yes that might be possible but are we prepared for random ENOMEM from
> > vfs calls which need to allocate a temporary name?
> >
> >> 
> >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> >> >>  }
> >> >
> >> > Don't we have a limit for the maximum number of open files?
> >> >
> >> 
> >> Yes, there is a system limit of maximum number of open files. However
> >> this limit is shared between different users on the system and one
> >> user can hog this resource. To cater that, we set the maximum limit
> >> very high and let the memory limit of each user limit the number of
> >> files they can open.
> >
> > Similarly here. Are all syscalls allocating a fd prepared to return
> > ENOMEM?
> >
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> Even before this patch I find memcg oom handling inconsistent.  Page
> cache pages trigger oom killer and may allow caller to succeed once the
> kernel retries.  But kmem allocations don't call oom killer.

It's consistent in the sense that only page faults enable the memcg
OOM killer. It's not the type of memory that decides, it's whether the
allocation context has a channel to communicate an error to userspace.

Whether userspace is able to handle -ENOMEM from syscalls was a voiced
concern at the time this patch was merged, although there haven't been
any reports so far, and it seemed like the lesser evil between that
and deadlocking the kernel.

If we could find a way to invoke the OOM killer safely, I would
welcome such patches.

> They surface errors to user space.  This makes memcg hard to use for
> memory overcommit because it's desirable for a high priority task to
> transparently kill a lower priority task using the memcg oom killer.
> 
> A few ideas on how to make it more flexible:
> 
> a) Go back to memcg oom killing within memcg charging.  This runs risk
>    of oom killing while caller holds locks which oom victim selection or
>    oom victim termination may need.  Google's been running this way for
>    a while.

We've had real-life reports of this breaking, so even if it works for
some people, I'd rather not revert to that way of doing things.

> b) Have every syscall return do something similar to page fault handler:
>    kmem allocations in oom memcg mark the current task as needing an oom
>    check return NULL.  If marked oom, syscall exit would use
>    mem_cgroup_oom_synchronize() before retrying the syscall.  Seems
>    risky.  I doubt every syscall is compatible with such a restart.

That sounds like a lateral move.

> c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>    which will oom kill if needed.

This makes the most sense to me. Architecturally, I imagine this would
look like b), with an OOM handler at the point of return to userspace,
except that we'd overcharge instead of retrying the syscall.
Michal Hocko Oct. 10, 2017, 9:10 a.m. UTC | #8
On Mon 09-10-17 20:17:54, Michal Hocko wrote:
> the primary concern for this patch was whether we really need/want to
> charge short therm objects which do not outlive a single syscall.

Let me expand on this some more. What is the benefit of kmem accounting
of such an object? It cannot stop any runaway as a syscall lifetime
allocations are bound to number of processes which we kind of contain by
other means. If we do account then we put a memory pressure due to
something that cannot be reclaimed by no means. Even the memcg OOM
killer would simply kick a single path while there might be others
to consume the same type of memory.

So what is the actual point in accounting these? Does it help to contain
any workload better? What kind of workload?

Or am I completely wrong and name objects can outlive a syscall
considerably?
Michal Hocko Oct. 10, 2017, 9:14 a.m. UTC | #9
On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> On Mon, Oct 09, 2017 at 10:52:44AM -0700, Greg Thelen wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
> > >> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
> > >> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
> > >> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
> > >> >
> > >> > I might be wrong but isn't name cache only holding temporary objects
> > >> > used for path resolution which are not stored anywhere?
> > >> >
> > >> 
> > >> Even though they're temporary, many containers can together use a
> > >> significant amount of transient uncharged memory. We've seen machines
> > >> with 100s of MiBs in names_cache.
> > >
> > > Yes that might be possible but are we prepared for random ENOMEM from
> > > vfs calls which need to allocate a temporary name?
> > >
> > >> 
> > >> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
> > >> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
> > >> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
> > >> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> > >> >>  }
> > >> >
> > >> > Don't we have a limit for the maximum number of open files?
> > >> >
> > >> 
> > >> Yes, there is a system limit of maximum number of open files. However
> > >> this limit is shared between different users on the system and one
> > >> user can hog this resource. To cater that, we set the maximum limit
> > >> very high and let the memory limit of each user limit the number of
> > >> files they can open.
> > >
> > > Similarly here. Are all syscalls allocating a fd prepared to return
> > > ENOMEM?
> > >
> > > -- 
> > > Michal Hocko
> > > SUSE Labs
> > 
> > Even before this patch I find memcg oom handling inconsistent.  Page
> > cache pages trigger oom killer and may allow caller to succeed once the
> > kernel retries.  But kmem allocations don't call oom killer.
> 
> It's consistent in the sense that only page faults enable the memcg
> OOM killer. It's not the type of memory that decides, it's whether the
> allocation context has a channel to communicate an error to userspace.
> 
> Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> concern at the time this patch was merged, although there haven't been
> any reports so far,

Well, I remember reports about MAP_POPULATE breaking or at least having
an unexpected behavior.

> and it seemed like the lesser evil between that
> and deadlocking the kernel.

agreed on this part though

> If we could find a way to invoke the OOM killer safely, I would
> welcome such patches.

Well, we should be able to do that with the oom_reaper. At least for v2
which doesn't have synchronous userspace oom killing.

[...]

> > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> >    which will oom kill if needed.
> 
> This makes the most sense to me. Architecturally, I imagine this would
> look like b), with an OOM handler at the point of return to userspace,
> except that we'd overcharge instead of retrying the syscall.

I do not think we should break the hard limit semantic if possible. We
can currently allow that for allocations which are very short term (oom
victims) or too important to fail but allowing that for kmem charges in
general sounds like too easy to runaway.
Johannes Weiner Oct. 10, 2017, 2:17 p.m. UTC | #10
On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > It's consistent in the sense that only page faults enable the memcg
> > OOM killer. It's not the type of memory that decides, it's whether the
> > allocation context has a channel to communicate an error to userspace.
> > 
> > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > concern at the time this patch was merged, although there haven't been
> > any reports so far,
> 
> Well, I remember reports about MAP_POPULATE breaking or at least having
> an unexpected behavior.

Hm, that slipped past me. Did we do something about these? Or did they
fix userspace?

> Well, we should be able to do that with the oom_reaper. At least for v2
> which doesn't have synchronous userspace oom killing.

I don't see how the OOM reaper is a guarantee as long as we have this:

	if (!down_read_trylock(&mm->mmap_sem)) {
		ret = false;
		trace_skip_task_reaping(tsk->pid);
		goto unlock_oom;
	}

What do you mean by 'v2'?

> > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > >    which will oom kill if needed.
> > 
> > This makes the most sense to me. Architecturally, I imagine this would
> > look like b), with an OOM handler at the point of return to userspace,
> > except that we'd overcharge instead of retrying the syscall.
> 
> I do not think we should break the hard limit semantic if possible. We
> can currently allow that for allocations which are very short term (oom
> victims) or too important to fail but allowing that for kmem charges in
> general sounds like too easy to runaway.

I'm not sure there is a convenient way out of this.

If we want to respect the hard limit AND guarantee allocation success,
the OOM killer has to free memory reliably - which it doesn't. But if
it did, we could also break the limit temporarily and have the OOM
killer replenish the pool before that userspace app can continue. The
allocation wouldn't have to be short-lived, since memory is fungible.

Until the OOM killer is 100% reliable, we have the choice between
sometimes deadlocking the cgroup tasks and everything that interacts
with them, returning -ENOMEM for syscalls, or breaking the hard limit
guarantee during memcg OOM.

It seems breaking the limit temporarily in order to reclaim memory is
the best option. There is kernel memory we don't account to the memcg
already because we think it's probably not going to be significant, so
the isolation isn't 100% watertight in the first place. And I'd rather
have the worst-case effect of a cgroup OOMing be spilling over its
hard limit than deadlocking things inside and outside the cgroup.
Michal Hocko Oct. 10, 2017, 2:24 p.m. UTC | #11
On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
> On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > > It's consistent in the sense that only page faults enable the memcg
> > > OOM killer. It's not the type of memory that decides, it's whether the
> > > allocation context has a channel to communicate an error to userspace.
> > > 
> > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > > concern at the time this patch was merged, although there haven't been
> > > any reports so far,
> > 
> > Well, I remember reports about MAP_POPULATE breaking or at least having
> > an unexpected behavior.
> 
> Hm, that slipped past me. Did we do something about these? Or did they
> fix userspace?

Well it was mostly LTP complaining. I have tried to fix that but Linus
was against so we just documented that this is possible and MAP_POPULATE
is not a guarantee.

> > Well, we should be able to do that with the oom_reaper. At least for v2
> > which doesn't have synchronous userspace oom killing.
> 
> I don't see how the OOM reaper is a guarantee as long as we have this:
> 
> 	if (!down_read_trylock(&mm->mmap_sem)) {
> 		ret = false;
> 		trace_skip_task_reaping(tsk->pid);
> 		goto unlock_oom;
> 	}

And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
killer if we fail to get the mmap_sem after several attempts. This will
allow to find a new victim. So we shouldn't deadlock.

> What do you mean by 'v2'?

cgroup v2 because the legacy memcg allowed sync wait for the oom killer
and that would be a bigger problem from a deep callchains for obevious
reasons.

> > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > >    which will oom kill if needed.
> > > 
> > > This makes the most sense to me. Architecturally, I imagine this would
> > > look like b), with an OOM handler at the point of return to userspace,
> > > except that we'd overcharge instead of retrying the syscall.
> > 
> > I do not think we should break the hard limit semantic if possible. We
> > can currently allow that for allocations which are very short term (oom
> > victims) or too important to fail but allowing that for kmem charges in
> > general sounds like too easy to runaway.
> 
> I'm not sure there is a convenient way out of this.
> 
> If we want to respect the hard limit AND guarantee allocation success,
> the OOM killer has to free memory reliably - which it doesn't. But if
> it did, we could also break the limit temporarily and have the OOM
> killer replenish the pool before that userspace app can continue. The
> allocation wouldn't have to be short-lived, since memory is fungible.

If we can guarantee the oom killer is started then we can allow temporal
access to reserves which is already implemented even for memcg. The
thing is we do not invoke the oom killer...
Shakeel Butt Oct. 10, 2017, 10:21 p.m. UTC | #12
On Sun, Oct 8, 2017 at 11:24 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 06-10-17 12:33:03, Shakeel Butt wrote:
>> >>       names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
>> >> -                     SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
>> >
>> > I might be wrong but isn't name cache only holding temporary objects
>> > used for path resolution which are not stored anywhere?
>> >
>>
>> Even though they're temporary, many containers can together use a
>> significant amount of transient uncharged memory. We've seen machines
>> with 100s of MiBs in names_cache.
>
> Yes that might be possible but are we prepared for random ENOMEM from
> vfs calls which need to allocate a temporary name?
>

I looked at all the syscalls which invoke allocations from
'names_cache' and tried to narrow down whose man page does not mention
that they can return ENOMEM. I found couple of syscalls like
truncate(), readdir() & getdents() which does not mention that they
can return ENOMEM but this patch will make them return ENOMEM.

>>
>> >>       filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>> >> -                     SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
>> >> +                     SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>> >>       percpu_counter_init(&nr_files, 0, GFP_KERNEL);
>> >>  }
>> >
>> > Don't we have a limit for the maximum number of open files?
>> >
>>
>> Yes, there is a system limit of maximum number of open files. However
>> this limit is shared between different users on the system and one
>> user can hog this resource. To cater that, we set the maximum limit
>> very high and let the memory limit of each user limit the number of
>> files they can open.
>
> Similarly here. Are all syscalls allocating a fd prepared to return
> ENOMEM?

For filp, I found _sysctl(). However the man page says not to use it.

On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 09-10-17 20:17:54, Michal Hocko wrote:
>> the primary concern for this patch was whether we really need/want to
>> charge short therm objects which do not outlive a single syscall.
>
> Let me expand on this some more. What is the benefit of kmem accounting
> of such an object? It cannot stop any runaway as a syscall lifetime
> allocations are bound to number of processes which we kind of contain by
> other means.

We can contain by limited the number of processes or thread but for us
applications having thousands of threads is very common. So, limiting
the number of threads/processes will not work.

> If we do account then we put a memory pressure due to
> something that cannot be reclaimed by no means. Even the memcg OOM
> killer would simply kick a single path while there might be others
> to consume the same type of memory.
>
> So what is the actual point in accounting these? Does it help to contain
> any workload better? What kind of workload?
>

I think the benefits will be isolation and more accurate billing. As I
have said before we have observed 100s of MiBs in names_cache on many
machines and cumulative amount is not something we can ignore as just
memory overhead.

> Or am I completely wrong and name objects can outlive a syscall
> considerably?
>

No, I didn't find any instance of the name objects outliving the syscall.

Anyways, we can discuss more on names_cache, do you have any objection
regarding charging filp?
Al Viro Oct. 10, 2017, 11:32 p.m. UTC | #13
On Thu, Oct 05, 2017 at 03:21:44PM -0700, Shakeel Butt wrote:
> The allocations from filp and names kmem caches can be directly
> triggered by user space applications. A buggy application can
> consume a significant amount of unaccounted system memory. Though
> we have not noticed such buggy applications in our production
> but upon close inspection, we found that a lot of machines spend
> very significant amount of memory on these caches. So, these
> caches should be accounted to kmemcg.

IDGI...  Surely, it's not hard to come up with a syscall that can
allocate a page for the duration of syscall?  Just to pick a random
example: reading from /proc/self/cmdline does that.  So does
readlink of /proc/self/cwd, etc.

What does accounting for such temporary allocations (with fixed
limit per syscall, always freed by the end of syscall) buy you,
why is it needed and what makes it not needed for the examples
above (and a slew of similar ones)?

While we are at it, how much overhead does it add on syscall-heavy
loads?  As in, a whole lot of threads is calling something like
stat("/", &stbuf); in parallel?  Because outside of that kind of
loads it's completely pointless...
Michal Hocko Oct. 11, 2017, 9:09 a.m. UTC | #14
On Tue 10-10-17 15:21:53, Shakeel Butt wrote:
[...]
> On Tue, Oct 10, 2017 at 2:10 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 09-10-17 20:17:54, Michal Hocko wrote:
> >> the primary concern for this patch was whether we really need/want to
> >> charge short therm objects which do not outlive a single syscall.
> >
> > Let me expand on this some more. What is the benefit of kmem accounting
> > of such an object? It cannot stop any runaway as a syscall lifetime
> > allocations are bound to number of processes which we kind of contain by
> > other means.
> 
> We can contain by limited the number of processes or thread but for us
> applications having thousands of threads is very common. So, limiting
> the number of threads/processes will not work.

Well, the number of tasks is already contained in a way because we do
account each task (kernel) stack AFAIR.

> > If we do account then we put a memory pressure due to
> > something that cannot be reclaimed by no means. Even the memcg OOM
> > killer would simply kick a single path while there might be others
> > to consume the same type of memory.
> >
> > So what is the actual point in accounting these? Does it help to contain
> > any workload better? What kind of workload?
> >
> 
> I think the benefits will be isolation and more accurate billing. As I
> have said before we have observed 100s of MiBs in names_cache on many
> machines and cumulative amount is not something we can ignore as just
> memory overhead.

I do agree with Al arguing this is rather dubious and it can add an
overhead without a good reason.

> > Or am I completely wrong and name objects can outlive a syscall
> > considerably?
> >
> 
> No, I didn't find any instance of the name objects outliving the syscall.
> 
> Anyways, we can discuss more on names_cache, do you have any objection
> regarding charging filp?

I think filep makes more sense. But let's drop the names for now. I am
not really convinced this is a good idea.
Johannes Weiner Oct. 12, 2017, 7:03 p.m. UTC | #15
On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
> On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
> > > > It's consistent in the sense that only page faults enable the memcg
> > > > OOM killer. It's not the type of memory that decides, it's whether the
> > > > allocation context has a channel to communicate an error to userspace.
> > > > 
> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
> > > > concern at the time this patch was merged, although there haven't been
> > > > any reports so far,
> > > 
> > > Well, I remember reports about MAP_POPULATE breaking or at least having
> > > an unexpected behavior.
> > 
> > Hm, that slipped past me. Did we do something about these? Or did they
> > fix userspace?
> 
> Well it was mostly LTP complaining. I have tried to fix that but Linus
> was against so we just documented that this is possible and MAP_POPULATE
> is not a guarantee.

Okay, makes sense. I wouldn't really count that as a regression.

> > > Well, we should be able to do that with the oom_reaper. At least for v2
> > > which doesn't have synchronous userspace oom killing.
> > 
> > I don't see how the OOM reaper is a guarantee as long as we have this:
> > 
> > 	if (!down_read_trylock(&mm->mmap_sem)) {
> > 		ret = false;
> > 		trace_skip_task_reaping(tsk->pid);
> > 		goto unlock_oom;
> > 	}
> 
> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
> killer if we fail to get the mmap_sem after several attempts. This will
> allow to find a new victim. So we shouldn't deadlock.

It's less likely to deadlock, but not exactly deadlock-free. There
might not BE any other mm's holding significant amounts of memory.

> > What do you mean by 'v2'?
> 
> cgroup v2 because the legacy memcg allowed sync wait for the oom killer
> and that would be a bigger problem from a deep callchains for obevious
> reasons.

Actually, the async oom killing code isn't dependent on cgroup
version. cgroup1 doesn't wait inside the charge context, either.

> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > > >    which will oom kill if needed.
> > > > 
> > > > This makes the most sense to me. Architecturally, I imagine this would
> > > > look like b), with an OOM handler at the point of return to userspace,
> > > > except that we'd overcharge instead of retrying the syscall.
> > > 
> > > I do not think we should break the hard limit semantic if possible. We
> > > can currently allow that for allocations which are very short term (oom
> > > victims) or too important to fail but allowing that for kmem charges in
> > > general sounds like too easy to runaway.
> > 
> > I'm not sure there is a convenient way out of this.
> > 
> > If we want to respect the hard limit AND guarantee allocation success,
> > the OOM killer has to free memory reliably - which it doesn't. But if
> > it did, we could also break the limit temporarily and have the OOM
> > killer replenish the pool before that userspace app can continue. The
> > allocation wouldn't have to be short-lived, since memory is fungible.
> 
> If we can guarantee the oom killer is started then we can allow temporal
> access to reserves which is already implemented even for memcg. The
> thing is we do not invoke the oom killer...

You lost me here. Which reserves?

All I'm saying is that, when the syscall-context fails to charge, we
should do mem_cgroup_oom() to set up the async OOM killer, let the
charge succeed over the hard limit - since the OOM killer will most
likely get us back below the limit - then mem_cgroup_oom_synchronize()
before the syscall returns to userspace.

That would avoid returning -ENOMEM from syscalls without the risk of
the hard limit deadlocking - at the risk of sometimes overrunning the
hard limit, but that seems like the least problematic behavior out of
the three.
Greg Thelen Oct. 12, 2017, 11:57 p.m. UTC | #16
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
>> On Tue 10-10-17 10:17:33, Johannes Weiner wrote:
>> > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote:
>> > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote:
>> > > > It's consistent in the sense that only page faults enable the memcg
>> > > > OOM killer. It's not the type of memory that decides, it's whether the
>> > > > allocation context has a channel to communicate an error to userspace.
>> > > > 
>> > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced
>> > > > concern at the time this patch was merged, although there haven't been
>> > > > any reports so far,
>> > > 
>> > > Well, I remember reports about MAP_POPULATE breaking or at least having
>> > > an unexpected behavior.
>> > 
>> > Hm, that slipped past me. Did we do something about these? Or did they
>> > fix userspace?
>> 
>> Well it was mostly LTP complaining. I have tried to fix that but Linus
>> was against so we just documented that this is possible and MAP_POPULATE
>> is not a guarantee.
>
> Okay, makes sense. I wouldn't really count that as a regression.
>
>> > > Well, we should be able to do that with the oom_reaper. At least for v2
>> > > which doesn't have synchronous userspace oom killing.
>> > 
>> > I don't see how the OOM reaper is a guarantee as long as we have this:
>> > 
>> > 	if (!down_read_trylock(&mm->mmap_sem)) {
>> > 		ret = false;
>> > 		trace_skip_task_reaping(tsk->pid);
>> > 		goto unlock_oom;
>> > 	}
>> 
>> And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
>> killer if we fail to get the mmap_sem after several attempts. This will
>> allow to find a new victim. So we shouldn't deadlock.
>
> It's less likely to deadlock, but not exactly deadlock-free. There
> might not BE any other mm's holding significant amounts of memory.
>
>> > What do you mean by 'v2'?
>> 
>> cgroup v2 because the legacy memcg allowed sync wait for the oom killer
>> and that would be a bigger problem from a deep callchains for obevious
>> reasons.
>
> Actually, the async oom killing code isn't dependent on cgroup
> version. cgroup1 doesn't wait inside the charge context, either.
>
>> > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
>> > > > >    which will oom kill if needed.
>> > > > 
>> > > > This makes the most sense to me. Architecturally, I imagine this would
>> > > > look like b), with an OOM handler at the point of return to userspace,
>> > > > except that we'd overcharge instead of retrying the syscall.
>> > > 
>> > > I do not think we should break the hard limit semantic if possible. We
>> > > can currently allow that for allocations which are very short term (oom
>> > > victims) or too important to fail but allowing that for kmem charges in
>> > > general sounds like too easy to runaway.
>> > 
>> > I'm not sure there is a convenient way out of this.
>> > 
>> > If we want to respect the hard limit AND guarantee allocation success,
>> > the OOM killer has to free memory reliably - which it doesn't. But if
>> > it did, we could also break the limit temporarily and have the OOM
>> > killer replenish the pool before that userspace app can continue. The
>> > allocation wouldn't have to be short-lived, since memory is fungible.
>> 
>> If we can guarantee the oom killer is started then we can allow temporal
>> access to reserves which is already implemented even for memcg. The
>> thing is we do not invoke the oom killer...
>
> You lost me here. Which reserves?
>
> All I'm saying is that, when the syscall-context fails to charge, we
> should do mem_cgroup_oom() to set up the async OOM killer, let the
> charge succeed over the hard limit - since the OOM killer will most
> likely get us back below the limit - then mem_cgroup_oom_synchronize()
> before the syscall returns to userspace.
>
> That would avoid returning -ENOMEM from syscalls without the risk of
> the hard limit deadlocking - at the risk of sometimes overrunning the
> hard limit, but that seems like the least problematic behavior out of
> the three.

Overcharging kmem with deferred reconciliation sounds good to me.

A few comments (not reasons to avoid this):

1) If a task is moved between memcg it seems possible to overcharge
   multiple oom memcg for different kmem/user allocations.
   mem_cgroup_oom_synchronize() would see at most one oom memcg in
   current->memcg_in_oom.  Thus it'd only reconcile a single memcg.  But
   that seems pretty rare and the next charge to any of the other memcg
   would reconcile them.

2) if a kernel thread charges kmem on behalf of a client mm then there
   is no good place to call mem_cgroup_oom_synchronize(), short of
   launching a work item in mem_cgroup_oom().  I don't we have anything
   like that yet.  So nothing to worry about.

3) it's debatable if mem_cgroup_oom_synchronize() should first attempt
   reclaim before killing.  But that's a whole 'nother thread.

4) overcharging with deferred reconciliation could also be used for user
   pages.  But I haven't looked at the code long enough to know if this
   would be a net win.
Michal Hocko Oct. 13, 2017, 6:35 a.m. UTC | #17
On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote:
[...]
> > And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom
> > killer if we fail to get the mmap_sem after several attempts. This will
> > allow to find a new victim. So we shouldn't deadlock.
> 
> It's less likely to deadlock, but not exactly deadlock-free. There
> might not BE any other mm's holding significant amounts of memory.

true, try_charge would have to return with failure when out_of_memory
returns with false of course.

> > > What do you mean by 'v2'?
> > 
> > cgroup v2 because the legacy memcg allowed sync wait for the oom killer
> > and that would be a bigger problem from a deep callchains for obevious
> > reasons.
> 
> Actually, the async oom killing code isn't dependent on cgroup
> version. cgroup1 doesn't wait inside the charge context, either.

Sorry, I was just not clear. What I meant to say, would couldn't make v1
wait inside the try_charge path because async oom killing wouldn't help
for the oom disabled case (aka user space oom handling).

> > > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker,
> > > > > >    which will oom kill if needed.
> > > > > 
> > > > > This makes the most sense to me. Architecturally, I imagine this would
> > > > > look like b), with an OOM handler at the point of return to userspace,
> > > > > except that we'd overcharge instead of retrying the syscall.
> > > > 
> > > > I do not think we should break the hard limit semantic if possible. We
> > > > can currently allow that for allocations which are very short term (oom
> > > > victims) or too important to fail but allowing that for kmem charges in
> > > > general sounds like too easy to runaway.
> > > 
> > > I'm not sure there is a convenient way out of this.
> > > 
> > > If we want to respect the hard limit AND guarantee allocation success,
> > > the OOM killer has to free memory reliably - which it doesn't. But if
> > > it did, we could also break the limit temporarily and have the OOM
> > > killer replenish the pool before that userspace app can continue. The
> > > allocation wouldn't have to be short-lived, since memory is fungible.
> > 
> > If we can guarantee the oom killer is started then we can allow temporal
> > access to reserves which is already implemented even for memcg. The
> > thing is we do not invoke the oom killer...
> 
> You lost me here. Which reserves?
> 
> All I'm saying is that, when the syscall-context fails to charge, we
> should do mem_cgroup_oom() to set up the async OOM killer, let the
> charge succeed over the hard limit - since the OOM killer will most
> likely get us back below the limit - then mem_cgroup_oom_synchronize()
> before the syscall returns to userspace.

OK, then we are on the same page now. Your initial wording didn't
mention async OOM killer. This makes more sense. Although I would argue
that we can retry the charge as long as out_of_memory finds a victim.
This would return ENOMEM to the pathological cases where no victims
could be found.
Michal Hocko Oct. 13, 2017, 6:51 a.m. UTC | #18
On Thu 12-10-17 16:57:22, Greg Thelen wrote:
[...]
> Overcharging kmem with deferred reconciliation sounds good to me.
> 
> A few comments (not reasons to avoid this):
> 
> 1) If a task is moved between memcg it seems possible to overcharge
>    multiple oom memcg for different kmem/user allocations.
>    mem_cgroup_oom_synchronize() would see at most one oom memcg in
>    current->memcg_in_oom.  Thus it'd only reconcile a single memcg.  But
>    that seems pretty rare and the next charge to any of the other memcg
>    would reconcile them.

This is a general problem for the cgroup v2 memcg oom handling. 

> 2) if a kernel thread charges kmem on behalf of a client mm then there
>    is no good place to call mem_cgroup_oom_synchronize(), short of
>    launching a work item in mem_cgroup_oom().  I don't we have anything
>    like that yet.  So nothing to worry about.

If we do invoke the OOM killer from the charge path, it shouldn't be a
problem.

> 3) it's debatable if mem_cgroup_oom_synchronize() should first attempt
>    reclaim before killing.  But that's a whole 'nother thread.

Again, this shouldn't be an issue if we invoke the oom killer from the
charge path.

> 4) overcharging with deferred reconciliation could also be used for user
>    pages.  But I haven't looked at the code long enough to know if this
>    would be a net win.

It would solve g-u-p issues failing with ENOMEM unexpectedly just
because of memcg charge failure.
Johannes Weiner Oct. 24, 2017, 3:45 p.m. UTC | #19
On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote:
> On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> > All I'm saying is that, when the syscall-context fails to charge, we
> > should do mem_cgroup_oom() to set up the async OOM killer, let the
> > charge succeed over the hard limit - since the OOM killer will most
> > likely get us back below the limit - then mem_cgroup_oom_synchronize()
> > before the syscall returns to userspace.
> 
> OK, then we are on the same page now. Your initial wording didn't
> mention async OOM killer. This makes more sense. Although I would argue
> that we can retry the charge as long as out_of_memory finds a victim.
> This would return ENOMEM to the pathological cases where no victims
> could be found.

I think that's much worse because it's even harder to test and verify
your applications against.

If syscalls can return -ENOMEM on OOM, they should do so reliably.
Michal Hocko Oct. 24, 2017, 4:30 p.m. UTC | #20
On Tue 24-10-17 11:45:11, Johannes Weiner wrote:
> On Fri, Oct 13, 2017 at 08:35:55AM +0200, Michal Hocko wrote:
> > On Thu 12-10-17 15:03:12, Johannes Weiner wrote:
> > > All I'm saying is that, when the syscall-context fails to charge, we
> > > should do mem_cgroup_oom() to set up the async OOM killer, let the
> > > charge succeed over the hard limit - since the OOM killer will most
> > > likely get us back below the limit - then mem_cgroup_oom_synchronize()
> > > before the syscall returns to userspace.
> > 
> > OK, then we are on the same page now. Your initial wording didn't
> > mention async OOM killer. This makes more sense. Although I would argue
> > that we can retry the charge as long as out_of_memory finds a victim.
> > This would return ENOMEM to the pathological cases where no victims
> > could be found.
> 
> I think that's much worse because it's even harder to test and verify
> your applications against.

Well, the main distinction to the global OOM killer is that we panic
when there is no oom victim eligible which we cannot do in the memcg
context. So we have to bail somehow and I would be really careful to
allow for a runaway from the hard limit just because we are out of all
eligible tasks. Returning ENOMEM sounds like a safer option to me.

> If syscalls can return -ENOMEM on OOM, they should do so reliably.

The main problem is that we do not know which syscalls can return ENOMEM
diff mbox

Patch

diff --git a/fs/dcache.c b/fs/dcache.c
index f90141387f01..fb3449161063 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3642,7 +3642,7 @@  void __init vfs_caches_init_early(void)
 void __init vfs_caches_init(void)
 {
 	names_cachep = kmem_cache_create("names_cache", PATH_MAX, 0,
-			SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	dcache_init();
 	inode_init();
diff --git a/fs/file_table.c b/fs/file_table.c
index 61517f57f8ef..567888cdf7d3 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -312,7 +312,7 @@  void put_filp(struct file *file)
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }