diff mbox series

[2/2] memcg: do not drain charge pcp caches on remote isolated cpus

Message ID 20230317134448.11082-3-mhocko@kernel.org (mailing list archive)
State New
Headers show
Series memcg, cpuisol: do not interfere pcp cache charges draining with cpuisol workloads | expand

Commit Message

Michal Hocko March 17, 2023, 1:44 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context
switching such as vRAN and Industrial Control Systems.

There are essentially two ways how to approach the issue. We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.

The current pcp charge cache is really optimized for high performance
and it always relies to stick with its cpu. That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp
WQ to drain locally again.

The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining. This
adds an atomic operation to otherwise simple arithmetic fast path in the
try_charge path. Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with
it indirectly via user space interfaces.

Another option is to avoid draining scheduling on isolated cpus
altogether. That means that those remote cpus would keep their charges
even after drain_all_stock returns. This is certainly not optimal either
but it shouldn't really cause any major problems. In the worst case
(many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
i.e 64 page) the memory consumption of a memcg would be artificially
higher than can be immediately used from other cpus.

Theoretically a memcg OOM killer could be triggered pre-maturely.
Currently it is not really clear whether this is a practical problem
though. Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory
reclaimed induced by memcg limit could break user space timing
expectations as those usually expect execution in the userspace most of
the time.

Also charges could be left behind on memcg removal. Any future charge on
those isolated cpus will drain that pcp cache so this won't be a
permanent leak.

Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the
target cpu is isolated. This solution is much more simpler. It doesn't
add any new locking and it is more more predictable from the user space
POV. Should the pre-mature memcg OOM become a real life problem, we can
revisit this decision.

Cc: Leonardo Brás <leobras@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Frederic Weisbecker <frederic@kernel.org>
Reported-by: Leonardo Bras <leobras@redhat.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Shakeel Butt March 17, 2023, 8:08 p.m. UTC | #1
On Fri, Mar 17, 2023 at 6:44 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> Leonardo Bras has noticed that pcp charge cache draining might be
> disruptive on workloads relying on 'isolated cpus', a feature commonly
> used on workloads that are sensitive to interruption and context
> switching such as vRAN and Industrial Control Systems.
>
> There are essentially two ways how to approach the issue. We can either
> allow the pcp cache to be drained on a different rather than a local cpu
> or avoid remote flushing on isolated cpus.
>
> The current pcp charge cache is really optimized for high performance
> and it always relies to stick with its cpu. That means it only requires
> local_lock (preempt_disable on !RT) and draining is handed over to pcp
> WQ to drain locally again.
>
> The former solution (remote draining) would require to add an additional
> locking to prevent local charges from racing with the draining. This
> adds an atomic operation to otherwise simple arithmetic fast path in the
> try_charge path. Another concern is that the remote draining can cause a
> lock contention for the isolated workloads and therefore interfere with
> it indirectly via user space interfaces.
>
> Another option is to avoid draining scheduling on isolated cpus
> altogether. That means that those remote cpus would keep their charges
> even after drain_all_stock returns. This is certainly not optimal either
> but it shouldn't really cause any major problems. In the worst case
> (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
> i.e 64 page) the memory consumption of a memcg would be artificially
> higher than can be immediately used from other cpus.
>
> Theoretically a memcg OOM killer could be triggered pre-maturely.
> Currently it is not really clear whether this is a practical problem
> though. Tight memcg limit would be really counter productive to cpu
> isolated workloads pretty much by definition because any memory
> reclaimed induced by memcg limit could break user space timing
> expectations as those usually expect execution in the userspace most of
> the time.
>
> Also charges could be left behind on memcg removal. Any future charge on
> those isolated cpus will drain that pcp cache so this won't be a
> permanent leak.
>
> Considering cons and pros of both approaches this patch is implementing
> the second option and simply do not schedule remote draining if the
> target cpu is isolated. This solution is much more simpler. It doesn't
> add any new locking and it is more more predictable from the user space
> POV. Should the pre-mature memcg OOM become a real life problem, we can
> revisit this decision.
>
> Cc: Leonardo Brás <leobras@redhat.com>
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Reported-by: Leonardo Bras <leobras@redhat.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Suggested-by: Roman Gushchin <roman.gushchin@linux.dev>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Shakeel Butt <shakeelb@google.com>
kernel test robot March 17, 2023, 9:51 p.m. UTC | #2
Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.3-rc2 next-20230317]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230317134448.11082-3-mhocko%40kernel.org
patch subject: [PATCH 2/2] memcg: do not drain charge pcp caches on remote isolated cpus
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20230318/202303180520.5lKAJwrg-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/b376cfcf40a43276e1280950bb926fdf3521940a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
        git checkout b376cfcf40a43276e1280950bb926fdf3521940a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303180520.5lKAJwrg-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/memcontrol.c:2369:14: error: implicit declaration of function 'cpu_is_isolated' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                           else if (!cpu_is_isolated(cpu))
                                     ^
   1 error generated.


vim +/cpu_is_isolated +2369 mm/memcontrol.c

  2331	
  2332	/*
  2333	 * Drains all per-CPU charge caches for given root_memcg resp. subtree
  2334	 * of the hierarchy under it.
  2335	 */
  2336	static void drain_all_stock(struct mem_cgroup *root_memcg)
  2337	{
  2338		int cpu, curcpu;
  2339	
  2340		/* If someone's already draining, avoid adding running more workers. */
  2341		if (!mutex_trylock(&percpu_charge_mutex))
  2342			return;
  2343		/*
  2344		 * Notify other cpus that system-wide "drain" is running
  2345		 * We do not care about races with the cpu hotplug because cpu down
  2346		 * as well as workers from this path always operate on the local
  2347		 * per-cpu data. CPU up doesn't touch memcg_stock at all.
  2348		 */
  2349		migrate_disable();
  2350		curcpu = smp_processor_id();
  2351		for_each_online_cpu(cpu) {
  2352			struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
  2353			struct mem_cgroup *memcg;
  2354			bool flush = false;
  2355	
  2356			rcu_read_lock();
  2357			memcg = stock->cached;
  2358			if (memcg && stock->nr_pages &&
  2359			    mem_cgroup_is_descendant(memcg, root_memcg))
  2360				flush = true;
  2361			else if (obj_stock_flush_required(stock, root_memcg))
  2362				flush = true;
  2363			rcu_read_unlock();
  2364	
  2365			if (flush &&
  2366			    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
  2367				if (cpu == curcpu)
  2368					drain_local_stock(&stock->work);
> 2369				else if (!cpu_is_isolated(cpu))
  2370					schedule_work_on(cpu, &stock->work);
  2371			}
  2372		}
  2373		migrate_enable();
  2374		mutex_unlock(&percpu_charge_mutex);
  2375	}
  2376
kernel test robot March 17, 2023, 10:22 p.m. UTC | #3
Hi Michal,

I love your patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.3-rc2 next-20230317]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230317134448.11082-3-mhocko%40kernel.org
patch subject: [PATCH 2/2] memcg: do not drain charge pcp caches on remote isolated cpus
config: x86_64-randconfig-a013 (https://download.01.org/0day-ci/archive/20230318/202303180617.7E3aIlHf-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/b376cfcf40a43276e1280950bb926fdf3521940a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
        git checkout b376cfcf40a43276e1280950bb926fdf3521940a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303180617.7E3aIlHf-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/memcontrol.c: In function 'drain_all_stock':
>> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
    2369 |                         else if (!cpu_is_isolated(cpu))
         |                                   ^~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/cpu_is_isolated +2369 mm/memcontrol.c

  2331	
  2332	/*
  2333	 * Drains all per-CPU charge caches for given root_memcg resp. subtree
  2334	 * of the hierarchy under it.
  2335	 */
  2336	static void drain_all_stock(struct mem_cgroup *root_memcg)
  2337	{
  2338		int cpu, curcpu;
  2339	
  2340		/* If someone's already draining, avoid adding running more workers. */
  2341		if (!mutex_trylock(&percpu_charge_mutex))
  2342			return;
  2343		/*
  2344		 * Notify other cpus that system-wide "drain" is running
  2345		 * We do not care about races with the cpu hotplug because cpu down
  2346		 * as well as workers from this path always operate on the local
  2347		 * per-cpu data. CPU up doesn't touch memcg_stock at all.
  2348		 */
  2349		migrate_disable();
  2350		curcpu = smp_processor_id();
  2351		for_each_online_cpu(cpu) {
  2352			struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
  2353			struct mem_cgroup *memcg;
  2354			bool flush = false;
  2355	
  2356			rcu_read_lock();
  2357			memcg = stock->cached;
  2358			if (memcg && stock->nr_pages &&
  2359			    mem_cgroup_is_descendant(memcg, root_memcg))
  2360				flush = true;
  2361			else if (obj_stock_flush_required(stock, root_memcg))
  2362				flush = true;
  2363			rcu_read_unlock();
  2364	
  2365			if (flush &&
  2366			    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
  2367				if (cpu == curcpu)
  2368					drain_local_stock(&stock->work);
> 2369				else if (!cpu_is_isolated(cpu))
  2370					schedule_work_on(cpu, &stock->work);
  2371			}
  2372		}
  2373		migrate_enable();
  2374		mutex_unlock(&percpu_charge_mutex);
  2375	}
  2376
Andrew Morton March 17, 2023, 11:32 p.m. UTC | #4
On Sat, 18 Mar 2023 06:22:40 +0800 kernel test robot <lkp@intel.com> wrote:

>    mm/memcontrol.c: In function 'drain_all_stock':
> >> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
>     2369 |                         else if (!cpu_is_isolated(cpu))
>          |                                   ^~~~~~~~~~~~~~~

Thanks.

--- a/mm/memcontrol.c~memcg-do-not-drain-charge-pcp-caches-on-remote-isolated-cpus-fix
+++ a/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
+#include <linux/sched/isolation.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
Hillf Danton March 18, 2023, 3:23 a.m. UTC | #5
On 17 Mar 2023 14:44:48 +0100 Michal Hocko <mhocko@suse.com>
> Leonardo Bras has noticed that pcp charge cache draining might be
> disruptive on workloads relying on 'isolated cpus', a feature commonly
> used on workloads that are sensitive to interruption and context
> switching such as vRAN and Industrial Control Systems.
> 
> There are essentially two ways how to approach the issue. We can either
> allow the pcp cache to be drained on a different rather than a local cpu
> or avoid remote flushing on isolated cpus.
> 
> The current pcp charge cache is really optimized for high performance
> and it always relies to stick with its cpu. That means it only requires
> local_lock (preempt_disable on !RT) and draining is handed over to pcp
> WQ to drain locally again.
> 
> The former solution (remote draining) would require to add an additional
> locking to prevent local charges from racing with the draining. This
> adds an atomic operation to otherwise simple arithmetic fast path in the
> try_charge path. Another concern is that the remote draining can cause a
> lock contention for the isolated workloads and therefore interfere with
> it indirectly via user space interfaces.
> 
> Another option is to avoid draining scheduling on isolated cpus
> altogether. That means that those remote cpus would keep their charges
> even after drain_all_stock returns. This is certainly not optimal either
> but it shouldn't really cause any major problems. In the worst case
> (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
> i.e 64 page) the memory consumption of a memcg would be artificially
> higher than can be immediately used from other cpus.
> 
> Theoretically a memcg OOM killer could be triggered pre-maturely.
> Currently it is not really clear whether this is a practical problem
> though. Tight memcg limit would be really counter productive to cpu
> isolated workloads pretty much by definition because any memory
> reclaimed induced by memcg limit could break user space timing
> expectations as those usually expect execution in the userspace most of
> the time.
> 
> Also charges could be left behind on memcg removal. Any future charge on
> those isolated cpus will drain that pcp cache so this won't be a
> permanent leak.
> 
> Considering cons and pros of both approaches this patch is implementing
> the second option and simply do not schedule remote draining if the
> target cpu is isolated. This solution is much more simpler. It doesn't
> add any new locking and it is more more predictable from the user space
> POV. Should the pre-mature memcg OOM become a real life problem, we can
> revisit this decision.

JFYI feel free to take a look at the non-housekeeping CPUs [1].

[1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/
Michal Hocko March 18, 2023, 8:03 a.m. UTC | #6
On Fri 17-03-23 16:32:29, Andrew Morton wrote:
> On Sat, 18 Mar 2023 06:22:40 +0800 kernel test robot <lkp@intel.com> wrote:
> 
> >    mm/memcontrol.c: In function 'drain_all_stock':
> > >> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
> >     2369 |                         else if (!cpu_is_isolated(cpu))
> >          |                                   ^~~~~~~~~~~~~~~
> 
> Thanks.
> 
> --- a/mm/memcontrol.c~memcg-do-not-drain-charge-pcp-caches-on-remote-isolated-cpus-fix
> +++ a/mm/memcontrol.c
> @@ -63,6 +63,7 @@
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> +#include <linux/sched/isolation.h>
>  #include "internal.h"
>  #include <net/sock.h>
>  #include <net/ip.h>

Thanks a lot Andrew!
> _
Michal Hocko March 18, 2023, 8:08 a.m. UTC | #7
On Sat 18-03-23 11:23:50, Hillf Danton wrote:
> On 17 Mar 2023 14:44:48 +0100 Michal Hocko <mhocko@suse.com>
> > Leonardo Bras has noticed that pcp charge cache draining might be
> > disruptive on workloads relying on 'isolated cpus', a feature commonly
> > used on workloads that are sensitive to interruption and context
> > switching such as vRAN and Industrial Control Systems.
> > 
> > There are essentially two ways how to approach the issue. We can either
> > allow the pcp cache to be drained on a different rather than a local cpu
> > or avoid remote flushing on isolated cpus.
> > 
> > The current pcp charge cache is really optimized for high performance
> > and it always relies to stick with its cpu. That means it only requires
> > local_lock (preempt_disable on !RT) and draining is handed over to pcp
> > WQ to drain locally again.
> > 
> > The former solution (remote draining) would require to add an additional
> > locking to prevent local charges from racing with the draining. This
> > adds an atomic operation to otherwise simple arithmetic fast path in the
> > try_charge path. Another concern is that the remote draining can cause a
> > lock contention for the isolated workloads and therefore interfere with
> > it indirectly via user space interfaces.
> > 
> > Another option is to avoid draining scheduling on isolated cpus
> > altogether. That means that those remote cpus would keep their charges
> > even after drain_all_stock returns. This is certainly not optimal either
> > but it shouldn't really cause any major problems. In the worst case
> > (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
> > i.e 64 page) the memory consumption of a memcg would be artificially
> > higher than can be immediately used from other cpus.
> > 
> > Theoretically a memcg OOM killer could be triggered pre-maturely.
> > Currently it is not really clear whether this is a practical problem
> > though. Tight memcg limit would be really counter productive to cpu
> > isolated workloads pretty much by definition because any memory
> > reclaimed induced by memcg limit could break user space timing
> > expectations as those usually expect execution in the userspace most of
> > the time.
> > 
> > Also charges could be left behind on memcg removal. Any future charge on
> > those isolated cpus will drain that pcp cache so this won't be a
> > permanent leak.
> > 
> > Considering cons and pros of both approaches this patch is implementing
> > the second option and simply do not schedule remote draining if the
> > target cpu is isolated. This solution is much more simpler. It doesn't
> > add any new locking and it is more more predictable from the user space
> > POV. Should the pre-mature memcg OOM become a real life problem, we can
> > revisit this decision.
> 
> JFYI feel free to take a look at the non-housekeeping CPUs [1].
> 
> [1] https://lore.kernel.org/lkml/20230223150624.GA29739@lst.de/

Such an approach would require remote draining and I hope I have
explained why that is not a preferred way in this case. Other than that
I do agree with Christoph that a generic approach would be really nice.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0524add35cae..12559c08d976 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2366,7 +2366,7 @@  static void drain_all_stock(struct mem_cgroup *root_memcg)
 		    !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
 				drain_local_stock(&stock->work);
-			else
+			else if (!cpu_is_isolated(cpu))
 				schedule_work_on(cpu, &stock->work);
 		}
 	}