memcg: force charge kmem counter too
diff mbox

Message ID 20180525185501.82098-1-shakeelb@google.com
State New
Headers show

Commit Message

Shakeel Butt May 25, 2018, 6:55 p.m. UTC
Based on several conditions the kernel can decide to force charge an
allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
on kmem counter is set and reached.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 mm/memcontrol.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

Comments

Vladimir Davydov May 26, 2018, 6:51 p.m. UTC | #1
On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> Based on several conditions the kernel can decide to force charge an
> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> on kmem counter is set and reached.

memory.kmem.limit is broken and unlikely to ever be fixed as this knob
was deprecated in cgroup-v2. The fact that hitting the limit doesn't
trigger reclaim can result in unexpected behavior from user's pov, like
getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
allocations isn't going to fix those problem. So I'd suggest to avoid
setting memory.kmem.limit instead of trying to fix it or, even better,
switch to cgroup-v2.

> 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  mm/memcontrol.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab5673dbfc4e..0a88f824c550 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1893,6 +1893,18 @@ void mem_cgroup_handle_over_high(void)
>  	current->memcg_nr_pages_over_high = 0;
>  }
>  
> +/*
> + * Based on try_charge() force charge conditions.
> + */
> +static inline bool should_force_charge(gfp_t gfp_mask)
> +{
> +	return (unlikely(tsk_is_oom_victim(current) ||
> +			 fatal_signal_pending(current) ||
> +			 current->flags & PF_EXITING ||
> +			 current->flags & PF_MEMALLOC ||
> +			 gfp_mask & __GFP_NOFAIL));
> +}
> +
>  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		      unsigned int nr_pages)
>  {
> @@ -2008,6 +2020,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * The allocation either can't fail or will lead to more memory
>  	 * being freed very soon.  Allow memory usage go over the limit
>  	 * temporarily by force charging it.
> +	 *
> +	 * NOTE: Please keep the should_force_charge() conditions in sync.
>  	 */
>  	page_counter_charge(&memcg->memory, nr_pages);
>  	if (do_memsw_account())
> @@ -2331,8 +2345,11 @@ int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>  	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> -		cancel_charge(memcg, nr_pages);
> -		return -ENOMEM;
> +		if (!should_force_charge(gfp)) {
> +			cancel_charge(memcg, nr_pages);
> +			return -ENOMEM;
> +		}
> +		page_counter_charge(&memcg->kmem, nr_pages);
>  	}
>  
>  	page->mem_cgroup = memcg;
Shakeel Butt May 26, 2018, 10:37 p.m. UTC | #2
On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
<vdavydov.dev@gmail.com> wrote:
> On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
>> Based on several conditions the kernel can decide to force charge an
>> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
>> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
>> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
>> on kmem counter is set and reached.
>
> memory.kmem.limit is broken and unlikely to ever be fixed as this knob
> was deprecated in cgroup-v2. The fact that hitting the limit doesn't
> trigger reclaim can result in unexpected behavior from user's pov, like
> getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
> allocations isn't going to fix those problem.

I understand that fixing NOFAIL will not fix all other issues but it
still is better than current situation. IMHO we should keep fixing
kmem bit by bit.

One crazy idea is to just break it completely by force charging all the time.
Michal Hocko May 28, 2018, 9:11 a.m. UTC | #3
On Sat 26-05-18 15:37:05, Shakeel Butt wrote:
> On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
> <vdavydov.dev@gmail.com> wrote:
> > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
> >> Based on several conditions the kernel can decide to force charge an
> >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
> >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
> >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
> >> on kmem counter is set and reached.
> >
> > memory.kmem.limit is broken and unlikely to ever be fixed as this knob
> > was deprecated in cgroup-v2. The fact that hitting the limit doesn't
> > trigger reclaim can result in unexpected behavior from user's pov, like
> > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
> > allocations isn't going to fix those problem.
> 
> I understand that fixing NOFAIL will not fix all other issues but it
> still is better than current situation. IMHO we should keep fixing
> kmem bit by bit.
> 
> One crazy idea is to just break it completely by force charging all the time.

What is the limit good for then? Accounting?
Shakeel Butt May 28, 2018, 5:23 p.m. UTC | #4
On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Sat 26-05-18 15:37:05, Shakeel Butt wrote:
>> On Sat, May 26, 2018 at 11:51 AM, Vladimir Davydov
>> <vdavydov.dev@gmail.com> wrote:
>> > On Fri, May 25, 2018 at 11:55:01AM -0700, Shakeel Butt wrote:
>> >> Based on several conditions the kernel can decide to force charge an
>> >> allocation for a memcg i.e. overcharge memcg->memory and memcg->memsw
>> >> counters. Do the same for memcg->kmem counter too. In cgroup-v1, this
>> >> bug can cause a __GFP_NOFAIL kmem allocation fail if an explicit limit
>> >> on kmem counter is set and reached.
>> >
>> > memory.kmem.limit is broken and unlikely to ever be fixed as this knob
>> > was deprecated in cgroup-v2. The fact that hitting the limit doesn't
>> > trigger reclaim can result in unexpected behavior from user's pov, like
>> > getting ENOMEM while listing a directory. Bypassing the limit for NOFAIL
>> > allocations isn't going to fix those problem.
>>
>> I understand that fixing NOFAIL will not fix all other issues but it
>> still is better than current situation. IMHO we should keep fixing
>> kmem bit by bit.
>>
>> One crazy idea is to just break it completely by force charging all the time.
>
> What is the limit good for then? Accounting?
>

Unlike tcpmem, the kmem accounting is enabled by default. No need to
set the limit to enable accounting.

I think my crazy idea was just wrong and without much thought. Though
is there a precedence where the broken feature is not fixed because an
alternative is available?
Michal Hocko May 29, 2018, 8:31 a.m. UTC | #5
On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> Though is there a precedence where the broken feature is not fixed
> because an alternative is available?

Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
other hand we keep saying that kmem accounting in v1 is hard usable and
strongly discourage people from using it. Sure we can add the code which
handles _this_ particular case but that wouldn't make the whole thing
more usable I strongly suspect. Maybe I am wrong and you can provide
some specific examples. Is GFP_NOFAIL that common to matter?

In any case we should balance between the code maintainability here.
Adding more cruft into the allocator path is not free.
Shakeel Butt May 30, 2018, 6:14 p.m. UTC | #6
On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
>> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
>> Though is there a precedence where the broken feature is not fixed
>> because an alternative is available?
>
> Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> other hand we keep saying that kmem accounting in v1 is hard usable and
> strongly discourage people from using it. Sure we can add the code which
> handles _this_ particular case but that wouldn't make the whole thing
> more usable I strongly suspect. Maybe I am wrong and you can provide
> some specific examples. Is GFP_NOFAIL that common to matter?
>
> In any case we should balance between the code maintainability here.
> Adding more cruft into the allocator path is not free.
>

We do not use kmem limits internally and this is something I found
through code inspection. If this patch is increasing the cost of code
maintainability I am fine with dropping it but at least there should a
comment saying that kmem limits are broken and no need fix.

Shakeel
Minchan Kim May 31, 2018, 6:01 a.m. UTC | #7
On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >> Though is there a precedence where the broken feature is not fixed
> >> because an alternative is available?
> >
> > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > other hand we keep saying that kmem accounting in v1 is hard usable and
> > strongly discourage people from using it. Sure we can add the code which
> > handles _this_ particular case but that wouldn't make the whole thing
> > more usable I strongly suspect. Maybe I am wrong and you can provide
> > some specific examples. Is GFP_NOFAIL that common to matter?
> >
> > In any case we should balance between the code maintainability here.
> > Adding more cruft into the allocator path is not free.
> >
> 
> We do not use kmem limits internally and this is something I found
> through code inspection. If this patch is increasing the cost of code
> maintainability I am fine with dropping it but at least there should a
> comment saying that kmem limits are broken and no need fix.
 
I agree.

Even, I didn't know kmem is strongly discouraged until now. Then,
why is it enabled by default on cgroup v1?

Let's turn if off with comment "It's broken so do not use/fix. Instead,
please move to cgroup v2".
Michal Hocko May 31, 2018, 6:56 a.m. UTC | #8
On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > >> Though is there a precedence where the broken feature is not fixed
> > >> because an alternative is available?
> > >
> > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > strongly discourage people from using it. Sure we can add the code which
> > > handles _this_ particular case but that wouldn't make the whole thing
> > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > some specific examples. Is GFP_NOFAIL that common to matter?
> > >
> > > In any case we should balance between the code maintainability here.
> > > Adding more cruft into the allocator path is not free.
> > >
> > 
> > We do not use kmem limits internally and this is something I found
> > through code inspection. If this patch is increasing the cost of code
> > maintainability I am fine with dropping it but at least there should a
> > comment saying that kmem limits are broken and no need fix.
>  
> I agree.
> 
> Even, I didn't know kmem is strongly discouraged until now. Then,
> why is it enabled by default on cgroup v1?

You have to set a non-zero limit to make it active IIRC. The code is
compiled in because v2 enables it by default.

> Let's turn if off with comment "It's broken so do not use/fix. Instead,
> please move to cgroup v2".
Minchan Kim May 31, 2018, 8:23 a.m. UTC | #9
On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote:
> On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > >> Though is there a precedence where the broken feature is not fixed
> > > >> because an alternative is available?
> > > >
> > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > > strongly discourage people from using it. Sure we can add the code which
> > > > handles _this_ particular case but that wouldn't make the whole thing
> > > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > > some specific examples. Is GFP_NOFAIL that common to matter?
> > > >
> > > > In any case we should balance between the code maintainability here.
> > > > Adding more cruft into the allocator path is not free.
> > > >
> > > 
> > > We do not use kmem limits internally and this is something I found
> > > through code inspection. If this patch is increasing the cost of code
> > > maintainability I am fine with dropping it but at least there should a
> > > comment saying that kmem limits are broken and no need fix.
> >  
> > I agree.
> > 
> > Even, I didn't know kmem is strongly discouraged until now. Then,
> > why is it enabled by default on cgroup v1?
> 
> You have to set a non-zero limit to make it active IIRC. The code is

Maybe, no. I didn't set anything. IOW, it was a default without any setting
and I hit this as you can remember.
http://lkml.kernel.org/r/<20180418022912.248417-1-minchan@kernel.org>
We don't need to allocate memory for stuff even maintainers discourage.

> compiled in because v2 enables it by default.
> 
> > Let's turn if off with comment "It's broken so do not use/fix. Instead,
> > please move to cgroup v2".
> 
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 31, 2018, 8:51 a.m. UTC | #10
On Thu 31-05-18 17:23:17, Minchan Kim wrote:
> On Thu, May 31, 2018 at 08:56:42AM +0200, Michal Hocko wrote:
> > On Thu 31-05-18 15:01:33, Minchan Kim wrote:
> > > On Wed, May 30, 2018 at 11:14:33AM -0700, Shakeel Butt wrote:
> > > > On Tue, May 29, 2018 at 1:31 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > > On Mon 28-05-18 10:23:07, Shakeel Butt wrote:
> > > > >> On Mon, May 28, 2018 at 2:11 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > >> Though is there a precedence where the broken feature is not fixed
> > > > >> because an alternative is available?
> > > > >
> > > > > Well, I can see how breaking GFP_NOFAIL semantic is problematic, on the
> > > > > other hand we keep saying that kmem accounting in v1 is hard usable and
> > > > > strongly discourage people from using it. Sure we can add the code which
> > > > > handles _this_ particular case but that wouldn't make the whole thing
> > > > > more usable I strongly suspect. Maybe I am wrong and you can provide
> > > > > some specific examples. Is GFP_NOFAIL that common to matter?
> > > > >
> > > > > In any case we should balance between the code maintainability here.
> > > > > Adding more cruft into the allocator path is not free.
> > > > >
> > > > 
> > > > We do not use kmem limits internally and this is something I found
> > > > through code inspection. If this patch is increasing the cost of code
> > > > maintainability I am fine with dropping it but at least there should a
> > > > comment saying that kmem limits are broken and no need fix.
> > >  
> > > I agree.
> > > 
> > > Even, I didn't know kmem is strongly discouraged until now. Then,
> > > why is it enabled by default on cgroup v1?
> > 
> > You have to set a non-zero limit to make it active IIRC. The code is
> 
> Maybe, no. I didn't set anything. IOW, it was a default without any setting
> and I hit this as you can remember.
> http://lkml.kernel.org/r/<20180418022912.248417-1-minchan@kernel.org>
> We don't need to allocate memory for stuff even maintainers discourage.

Well, you can disable the whole kmem accounting by cgroup.memory=nokmem
kernel parameter. I wasn't very happy when we removed the config option
which would have put all the tracking aside even when no limit is set
but the argument was that the ifdefery was growing into an
unmaintainable mess and I have to agree with that. So here we are...

Patch
diff mbox

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab5673dbfc4e..0a88f824c550 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1893,6 +1893,18 @@  void mem_cgroup_handle_over_high(void)
 	current->memcg_nr_pages_over_high = 0;
 }
 
+/*
+ * Based on try_charge() force charge conditions.
+ */
+static inline bool should_force_charge(gfp_t gfp_mask)
+{
+	return (unlikely(tsk_is_oom_victim(current) ||
+			 fatal_signal_pending(current) ||
+			 current->flags & PF_EXITING ||
+			 current->flags & PF_MEMALLOC ||
+			 gfp_mask & __GFP_NOFAIL));
+}
+
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		      unsigned int nr_pages)
 {
@@ -2008,6 +2020,8 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * The allocation either can't fail or will lead to more memory
 	 * being freed very soon.  Allow memory usage go over the limit
 	 * temporarily by force charging it.
+	 *
+	 * NOTE: Please keep the should_force_charge() conditions in sync.
 	 */
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
@@ -2331,8 +2345,11 @@  int memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
-		cancel_charge(memcg, nr_pages);
-		return -ENOMEM;
+		if (!should_force_charge(gfp)) {
+			cancel_charge(memcg, nr_pages);
+			return -ENOMEM;
+		}
+		page_counter_charge(&memcg->kmem, nr_pages);
 	}
 
 	page->mem_cgroup = memcg;