diff mbox series

[3/3] mm: memcontrol: deprecate charge moving

Message ID 20221206171340.139790-4-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series mm: push down lock_page_memcg() | expand

Commit Message

Johannes Weiner Dec. 6, 2022, 5:13 p.m. UTC
Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups. This is, and always has been, a questionable
thing to do - for several reasons.

First, it's expensive. Pages need to be identified, locked and
isolated from various MM operations, and reassigned, one by one.

Second, it's unreliable. Once pages are charged to a cgroup, there
isn't always a clear owner task anymore. Cache isn't moved at all, for
example. Mapped memory is moved - but if trylocking or isolating a
page fails, it's arbitrarily left behind. Frequent moving between
domains may leave a task's memory scattered all over the place.

Third, it isn't really needed. Launcher tasks can kick off workload
tasks directly in their target cgroup. Using dedicated per-workload
groups allows fine-grained policy adjustments - no need to move tasks
and their physical pages between control domains. The feature was
never forward-ported to cgroup2, and it hasn't been missed.

Despite it being a niche usecase, the maintenance overhead of
supporting it is enormous. Because pages are moved while they are live
and subject to various MM operations, the synchronization rules are
complicated. There are lock_page_memcg() in MM and FS code, which
non-cgroup people don't understand. In some cases we've been able to
shift code and cgroup API calls around such that we can rely on native
locking as much as possible. But that's fragile, and sometimes we need
to hold MM locks for longer than we otherwise would (pte lock e.g.).

Mark the feature deprecated. Hopefully we can remove it soon.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
 mm/memcontrol.c                                |  4 ++++
 2 files changed, 14 insertions(+), 1 deletion(-)

Comments

Shakeel Butt Dec. 7, 2022, 12:03 a.m. UTC | #1
On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Charge moving mode in cgroup1 allows memory to follow tasks as they
> migrate between cgroups. This is, and always has been, a questionable
> thing to do - for several reasons.
>
> First, it's expensive. Pages need to be identified, locked and
> isolated from various MM operations, and reassigned, one by one.
>
> Second, it's unreliable. Once pages are charged to a cgroup, there
> isn't always a clear owner task anymore. Cache isn't moved at all, for
> example. Mapped memory is moved - but if trylocking or isolating a
> page fails, it's arbitrarily left behind. Frequent moving between
> domains may leave a task's memory scattered all over the place.
>
> Third, it isn't really needed. Launcher tasks can kick off workload
> tasks directly in their target cgroup. Using dedicated per-workload
> groups allows fine-grained policy adjustments - no need to move tasks
> and their physical pages between control domains. The feature was
> never forward-ported to cgroup2, and it hasn't been missed.
>
> Despite it being a niche usecase, the maintenance overhead of
> supporting it is enormous. Because pages are moved while they are live
> and subject to various MM operations, the synchronization rules are
> complicated. There are lock_page_memcg() in MM and FS code, which
> non-cgroup people don't understand. In some cases we've been able to
> shift code and cgroup API calls around such that we can rely on native
> locking as much as possible. But that's fragile, and sometimes we need
> to hold MM locks for longer than we otherwise would (pte lock e.g.).
>
> Mark the feature deprecated. Hopefully we can remove it soon.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Shakeel Butt <shakeelb@google.com>

I would request this patch to be backported to stable kernels as well
for early warnings to users which update to newer kernels very late.
Hugh Dickins Dec. 7, 2022, 1:58 a.m. UTC | #2
On Tue, 6 Dec 2022, Johannes Weiner wrote:

> Charge moving mode in cgroup1 allows memory to follow tasks as they
> migrate between cgroups. This is, and always has been, a questionable
> thing to do - for several reasons.
> 
> First, it's expensive. Pages need to be identified, locked and
> isolated from various MM operations, and reassigned, one by one.
> 
> Second, it's unreliable. Once pages are charged to a cgroup, there
> isn't always a clear owner task anymore. Cache isn't moved at all, for
> example. Mapped memory is moved - but if trylocking or isolating a
> page fails, it's arbitrarily left behind. Frequent moving between
> domains may leave a task's memory scattered all over the place.
> 
> Third, it isn't really needed. Launcher tasks can kick off workload
> tasks directly in their target cgroup. Using dedicated per-workload
> groups allows fine-grained policy adjustments - no need to move tasks
> and their physical pages between control domains. The feature was
> never forward-ported to cgroup2, and it hasn't been missed.
> 
> Despite it being a niche usecase, the maintenance overhead of
> supporting it is enormous. Because pages are moved while they are live
> and subject to various MM operations, the synchronization rules are
> complicated. There are lock_page_memcg() in MM and FS code, which
> non-cgroup people don't understand. In some cases we've been able to
> shift code and cgroup API calls around such that we can rely on native
> locking as much as possible. But that's fragile, and sometimes we need
> to hold MM locks for longer than we otherwise would (pte lock e.g.).
> 
> Mark the feature deprecated. Hopefully we can remove it soon.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Hugh Dickins <hughd@google.com>

but I wonder if it would be helpful to mention move_charge_at_immigrate
in the deprecation message: maybe the first line should be
"Cgroup memory moving (move_charge_at_immigrate) is deprecated.\n"

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
>  mm/memcontrol.c                                |  4 ++++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 60370f2c67b9..87d7877b98ec 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -86,6 +86,8 @@ Brief summary of control files.
>   memory.swappiness		     set/show swappiness parameter of vmscan
>  				     (See sysctl's vm.swappiness)
>   memory.move_charge_at_immigrate     set/show controls of moving charges
> +                                     This knob is deprecated and shouldn't be
> +                                     used.
>   memory.oom_control		     set/show oom controls.
>   memory.numa_stat		     show the number of memory usage per numa
>  				     node
> @@ -717,9 +719,16 @@ Soft limits can be setup by using the following commands (in this example we
>         It is recommended to set the soft limit always below the hard limit,
>         otherwise the hard limit will take precedence.
>  
> -8. Move charges at task migration
> +8. Move charges at task migration (DEPRECATED!)
>  =================================
>  
> +THIS IS DEPRECATED!
> +
> +It's expensive and unreliable! It's better practice to launch workload
> +tasks directly from inside their target cgroup. Use dedicated workload
> +cgroups to allow fine-grained policy adjustments without having to
> +move physical pages between control domains.
> +
>  Users can move charges associated with a task along with task migration, that
>  is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
>  This feature is not supported in !CONFIG_MMU environments because of lack of
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b696354c1b21..e650a38d9a90 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3919,6 +3919,10 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  
> +	pr_warn_once("Cgroup memory moving is deprecated. "
> +		     "Please report your usecase to linux-mm@kvack.org if you "
> +		     "depend on this functionality.\n");
> +
>  	if (val & ~MOVE_MASK)
>  		return -EINVAL;
>  
> -- 
> 2.38.1
> 
>
Johannes Weiner Dec. 7, 2022, 1 p.m. UTC | #3
On Tue, Dec 06, 2022 at 05:58:14PM -0800, Hugh Dickins wrote:
> On Tue, 6 Dec 2022, Johannes Weiner wrote:
> 
> > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > migrate between cgroups. This is, and always has been, a questionable
> > thing to do - for several reasons.
> > 
> > First, it's expensive. Pages need to be identified, locked and
> > isolated from various MM operations, and reassigned, one by one.
> > 
> > Second, it's unreliable. Once pages are charged to a cgroup, there
> > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > example. Mapped memory is moved - but if trylocking or isolating a
> > page fails, it's arbitrarily left behind. Frequent moving between
> > domains may leave a task's memory scattered all over the place.
> > 
> > Third, it isn't really needed. Launcher tasks can kick off workload
> > tasks directly in their target cgroup. Using dedicated per-workload
> > groups allows fine-grained policy adjustments - no need to move tasks
> > and their physical pages between control domains. The feature was
> > never forward-ported to cgroup2, and it hasn't been missed.
> > 
> > Despite it being a niche usecase, the maintenance overhead of
> > supporting it is enormous. Because pages are moved while they are live
> > and subject to various MM operations, the synchronization rules are
> > complicated. There are lock_page_memcg() in MM and FS code, which
> > non-cgroup people don't understand. In some cases we've been able to
> > shift code and cgroup API calls around such that we can rely on native
> > locking as much as possible. But that's fragile, and sometimes we need
> > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> > 
> > Mark the feature deprecated. Hopefully we can remove it soon.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Hugh Dickins <hughd@google.com>

Thanks

> but I wonder if it would be helpful to mention move_charge_at_immigrate
> in the deprecation message: maybe the first line should be
> "Cgroup memory moving (move_charge_at_immigrate) is deprecated.\n"

Fair enough! Here is the updated patch.

---

From 0e791e6ab8ba2f75dd4205684c06bcc7308d9867 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 5 Dec 2022 19:57:06 +0100
Subject: [PATCH] mm: memcontrol: deprecate charge moving

Charge moving mode in cgroup1 allows memory to follow tasks as they
migrate between cgroups. This is, and always has been, a questionable
thing to do - for several reasons.

First, it's expensive. Pages need to be identified, locked and
isolated from various MM operations, and reassigned, one by one.

Second, it's unreliable. Once pages are charged to a cgroup, there
isn't always a clear owner task anymore. Cache isn't moved at all, for
example. Mapped memory is moved - but if trylocking or isolating a
page fails, it's arbitrarily left behind. Frequent moving between
domains may leave a task's memory scattered all over the place.

Third, it isn't really needed. Launcher tasks can kick off workload
tasks directly in their target cgroup. Using dedicated per-workload
groups allows fine-grained policy adjustments - no need to move tasks
and their physical pages between control domains. The feature was
never forward-ported to cgroup2, and it hasn't been missed.

Despite it being a niche usecase, the maintenance overhead of
supporting it is enormous. Because pages are moved while they are live
and subject to various MM operations, the synchronization rules are
complicated. There are lock_page_memcg() in MM and FS code, which
non-cgroup people don't understand. In some cases we've been able to
shift code and cgroup API calls around such that we can rely on native
locking as much as possible. But that's fragile, and sometimes we need
to hold MM locks for longer than we otherwise would (pte lock e.g.).

Mark the feature deprecated. Hopefully we can remove it soon.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 11 ++++++++++-
 mm/memcontrol.c                                |  4 ++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 60370f2c67b9..87d7877b98ec 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -86,6 +86,8 @@ Brief summary of control files.
  memory.swappiness		     set/show swappiness parameter of vmscan
 				     (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate     set/show controls of moving charges
+                                     This knob is deprecated and shouldn't be
+                                     used.
  memory.oom_control		     set/show oom controls.
  memory.numa_stat		     show the number of memory usage per numa
 				     node
@@ -717,9 +719,16 @@ Soft limits can be setup by using the following commands (in this example we
        It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. Move charges at task migration
+8. Move charges at task migration (DEPRECATED!)
 =================================
 
+THIS IS DEPRECATED!
+
+It's expensive and unreliable! It's better practice to launch workload
+tasks directly from inside their target cgroup. Use dedicated workload
+cgroups to allow fine-grained policy adjustments without having to
+move physical pages between control domains.
+
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b696354c1b21..9c9a42153b76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,6 +3919,10 @@ static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
 	if (val & ~MOVE_MASK)
 		return -EINVAL;
Andrew Morton Dec. 7, 2022, 9:51 p.m. UTC | #4
On Tue, 6 Dec 2022 16:03:54 -0800 Shakeel Butt <shakeelb@google.com> wrote:

> On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > migrate between cgroups. This is, and always has been, a questionable
> > thing to do - for several reasons.
> >
> > First, it's expensive. Pages need to be identified, locked and
> > isolated from various MM operations, and reassigned, one by one.
> >
> > Second, it's unreliable. Once pages are charged to a cgroup, there
> > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > example. Mapped memory is moved - but if trylocking or isolating a
> > page fails, it's arbitrarily left behind. Frequent moving between
> > domains may leave a task's memory scattered all over the place.
> >
> > Third, it isn't really needed. Launcher tasks can kick off workload
> > tasks directly in their target cgroup. Using dedicated per-workload
> > groups allows fine-grained policy adjustments - no need to move tasks
> > and their physical pages between control domains. The feature was
> > never forward-ported to cgroup2, and it hasn't been missed.
> >
> > Despite it being a niche usecase, the maintenance overhead of
> > supporting it is enormous. Because pages are moved while they are live
> > and subject to various MM operations, the synchronization rules are
> > complicated. There are lock_page_memcg() in MM and FS code, which
> > non-cgroup people don't understand. In some cases we've been able to
> > shift code and cgroup API calls around such that we can rely on native
> > locking as much as possible. But that's fragile, and sometimes we need
> > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> >
> > Mark the feature deprecated. Hopefully we can remove it soon.
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Acked-by: Shakeel Butt <shakeelb@google.com>
> 
> I would request this patch to be backported to stable kernels as well
> for early warnings to users which update to newer kernels very late.

Sounds reasonable, but the changelog should have a few words in it
explaining why we're requesting the backport.  I guess I can type those
in.

We're at -rc8 and I'm not planning on merging these up until after
6.2-rc1 is out.  Please feel free to argue with me on that score.
Shakeel Butt Dec. 7, 2022, 10:15 p.m. UTC | #5
On Wed, Dec 7, 2022 at 1:51 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 6 Dec 2022 16:03:54 -0800 Shakeel Butt <shakeelb@google.com> wrote:
>
> > On Tue, Dec 6, 2022 at 9:14 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Charge moving mode in cgroup1 allows memory to follow tasks as they
> > > migrate between cgroups. This is, and always has been, a questionable
> > > thing to do - for several reasons.
> > >
> > > First, it's expensive. Pages need to be identified, locked and
> > > isolated from various MM operations, and reassigned, one by one.
> > >
> > > Second, it's unreliable. Once pages are charged to a cgroup, there
> > > isn't always a clear owner task anymore. Cache isn't moved at all, for
> > > example. Mapped memory is moved - but if trylocking or isolating a
> > > page fails, it's arbitrarily left behind. Frequent moving between
> > > domains may leave a task's memory scattered all over the place.
> > >
> > > Third, it isn't really needed. Launcher tasks can kick off workload
> > > tasks directly in their target cgroup. Using dedicated per-workload
> > > groups allows fine-grained policy adjustments - no need to move tasks
> > > and their physical pages between control domains. The feature was
> > > never forward-ported to cgroup2, and it hasn't been missed.
> > >
> > > Despite it being a niche usecase, the maintenance overhead of
> > > supporting it is enormous. Because pages are moved while they are live
> > > and subject to various MM operations, the synchronization rules are
> > > complicated. There are lock_page_memcg() in MM and FS code, which
> > > non-cgroup people don't understand. In some cases we've been able to
> > > shift code and cgroup API calls around such that we can rely on native
> > > locking as much as possible. But that's fragile, and sometimes we need
> > > to hold MM locks for longer than we otherwise would (pte lock e.g.).
> > >
> > > Mark the feature deprecated. Hopefully we can remove it soon.
> > >
> > > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > Acked-by: Shakeel Butt <shakeelb@google.com>
> >
> > I would request this patch to be backported to stable kernels as well
> > for early warnings to users which update to newer kernels very late.
>
> Sounds reasonable, but the changelog should have a few words in it
> explaining why we're requesting the backport.  I guess I can type those
> in.

Thanks a lot.

>
> We're at -rc8 and I'm not planning on merging these up until after
> 6.2-rc1 is out.  Please feel free to argue with me on that score.

No, I totally agree with you. There is no such urgency in merging
these and a couple of weeks delay is totally fine.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 60370f2c67b9..87d7877b98ec 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -86,6 +86,8 @@  Brief summary of control files.
  memory.swappiness		     set/show swappiness parameter of vmscan
 				     (See sysctl's vm.swappiness)
  memory.move_charge_at_immigrate     set/show controls of moving charges
+                                     This knob is deprecated and shouldn't be
+                                     used.
  memory.oom_control		     set/show oom controls.
  memory.numa_stat		     show the number of memory usage per numa
 				     node
@@ -717,9 +719,16 @@  Soft limits can be setup by using the following commands (in this example we
        It is recommended to set the soft limit always below the hard limit,
        otherwise the hard limit will take precedence.
 
-8. Move charges at task migration
+8. Move charges at task migration (DEPRECATED!)
 =================================
 
+THIS IS DEPRECATED!
+
+It's expensive and unreliable! It's better practice to launch workload
+tasks directly from inside their target cgroup. Use dedicated workload
+cgroups to allow fine-grained policy adjustments without having to
+move physical pages between control domains.
+
 Users can move charges associated with a task along with task migration, that
 is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
 This feature is not supported in !CONFIG_MMU environments because of lack of
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b696354c1b21..e650a38d9a90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3919,6 +3919,10 @@  static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	pr_warn_once("Cgroup memory moving is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
 	if (val & ~MOVE_MASK)
 		return -EINVAL;