diff mbox series

[v5,2/6] mm/memcg: Disable threshold event handlers on PREEMPT_RT

Message ID 20220226204144.1008339-3-bigeasy@linutronix.de (mailing list archive)
State New
Headers show
Series mm/memcg: Address PREEMPT_RT problems instead of disabling it. | expand

Commit Message

Sebastian Andrzej Siewior Feb. 26, 2022, 8:41 p.m. UTC
During the integration of PREEMPT_RT support, the code flow around
memcg_check_events() resulted in `twisted code'. Moving the code around
and avoiding then would then lead to an additional local-irq-save
section within memcg_check_events(). While looking better, it adds a
local-irq-save section to code flow which is usually within an
local-irq-off block on non-PREEMPT_RT configurations.

The threshold event handler is a deprecated memcg v1 feature. Instead of
trying to get it to work under PREEMPT_RT just disable it. There should
be no users on PREEMPT_RT. From that perspective it makes even less
sense to get it to work under PREEMPT_RT while having zero users.

Make memory.soft_limit_in_bytes and cgroup.event_control return
-EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and
memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT.
Document that the two knobs are disabled on PREEMPT_RT.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/cgroup-v1/memory.rst |  2 ++
 mm/memcontrol.c                                | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

Comments

Valentin Schneider March 1, 2023, 6:23 p.m. UTC | #1
On 26/02/22 21:41, Sebastian Andrzej Siewior wrote:
> During the integration of PREEMPT_RT support, the code flow around
> memcg_check_events() resulted in `twisted code'. Moving the code around
> and avoiding then would then lead to an additional local-irq-save
> section within memcg_check_events(). While looking better, it adds a
> local-irq-save section to code flow which is usually within an
> local-irq-off block on non-PREEMPT_RT configurations.
>

Hey, sorry for necro'ing a year-old thread - would you happen to remember
what the issues were with memcg_check_events()? I ran tests against
cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't
detect anything, I'm guessing it has to do with the IRQ-off region
memcg_check_events() is called from?

I want cgroupv1 to die as much as the next person, but in that specific
situation I kinda need cgroupv1 to behave somewhat sanely on RT with
threshold events :/

Cheers
Michal Hocko March 2, 2023, 7:45 a.m. UTC | #2
On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
> On 26/02/22 21:41, Sebastian Andrzej Siewior wrote:
> > During the integration of PREEMPT_RT support, the code flow around
> > memcg_check_events() resulted in `twisted code'. Moving the code around
> > and avoiding then would then lead to an additional local-irq-save
> > section within memcg_check_events(). While looking better, it adds a
> > local-irq-save section to code flow which is usually within an
> > local-irq-off block on non-PREEMPT_RT configurations.
> >
> 
> Hey, sorry for necro'ing a year-old thread - would you happen to remember
> what the issues were with memcg_check_events()? I ran tests against
> cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't
> detect anything, I'm guessing it has to do with the IRQ-off region
> memcg_check_events() is called from?

I would have to look into details but IIRC the resulting code to make
the code RT safe was dreaded and hard to maintain as a result. As we
didn't really have any real life usecase, disabling the code was an
easier way to go forward. So it is not the code would be impossible to
be enabled for RT it just doeasn't seam to be worth all the complexity.

> I want cgroupv1 to die as much as the next person, but in that specific
> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
> threshold events :/

Could you expand on the usecase?
Valentin Schneider March 2, 2023, 10:18 a.m. UTC | #3
On 02/03/23 08:45, Michal Hocko wrote:
> On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
>> On 26/02/22 21:41, Sebastian Andrzej Siewior wrote:
>> > During the integration of PREEMPT_RT support, the code flow around
>> > memcg_check_events() resulted in `twisted code'. Moving the code around
>> > and avoiding then would then lead to an additional local-irq-save
>> > section within memcg_check_events(). While looking better, it adds a
>> > local-irq-save section to code flow which is usually within an
>> > local-irq-off block on non-PREEMPT_RT configurations.
>> >
>>
>> Hey, sorry for necro'ing a year-old thread - would you happen to remember
>> what the issues were with memcg_check_events()? I ran tests against
>> cgroupv1 using an eventfd on OOM with the usual debug arsenal and didn't
>> detect anything, I'm guessing it has to do with the IRQ-off region
>> memcg_check_events() is called from?
>
> I would have to look into details but IIRC the resulting code to make
> the code RT safe was dreaded and hard to maintain as a result. As we
> didn't really have any real life usecase, disabling the code was an
> easier way to go forward. So it is not the code would be impossible to
> be enabled for RT it just doeasn't seam to be worth all the complexity.
>

Right, thanks for having a look.

>> I want cgroupv1 to die as much as the next person, but in that specific
>> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
>> threshold events :/
>
> Could you expand on the usecase?
>

In this case it's just some middleware leveraging memcontrol cgroups and
setting up callbacks for in-cgroup OOM events. This is a supported feature
in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature
parity, but rather one of being in a transitional phase where the
middleware itself hasn't fully migrated to using cgroupv2.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 2, 2023, 11:24 a.m. UTC | #4
On Thu 02-03-23 10:18:31, Valentin Schneider wrote:
> On 02/03/23 08:45, Michal Hocko wrote:
> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
[...]
> >> I want cgroupv1 to die as much as the next person, but in that specific
> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
> >> threshold events :/
> >
> > Could you expand on the usecase?
> >
> 
> In this case it's just some middleware leveraging memcontrol cgroups and
> setting up callbacks for in-cgroup OOM events. This is a supported feature
> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature
> parity, but rather one of being in a transitional phase where the
> middleware itself hasn't fully migrated to using cgroupv2.

How is this related to the RT kernel config? memcg OOM vs any RT
assumptions do not really get along well AFAICT.
Valentin Schneider March 2, 2023, 12:30 p.m. UTC | #5
On 02/03/23 12:24, Michal Hocko wrote:
> On Thu 02-03-23 10:18:31, Valentin Schneider wrote:
>> On 02/03/23 08:45, Michal Hocko wrote:
>> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
> [...]
>> >> I want cgroupv1 to die as much as the next person, but in that specific
>> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
>> >> threshold events :/
>> >
>> > Could you expand on the usecase?
>> >
>>
>> In this case it's just some middleware leveraging memcontrol cgroups and
>> setting up callbacks for in-cgroup OOM events. This is a supported feature
>> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature
>> parity, but rather one of being in a transitional phase where the
>> middleware itself hasn't fully migrated to using cgroupv2.
>
> How is this related to the RT kernel config? memcg OOM vs any RT
> assumptions do not really get along well AFAICT.
>

Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least
regarding memory allocations, or lack thereof), but other non-RT-reliant
tasks on other CPUs come and go, hence the memcg involvement.

> --
> Michal Hocko
> SUSE Labs
Michal Hocko March 2, 2023, 12:56 p.m. UTC | #6
On Thu 02-03-23 12:30:33, Valentin Schneider wrote:
> On 02/03/23 12:24, Michal Hocko wrote:
> > On Thu 02-03-23 10:18:31, Valentin Schneider wrote:
> >> On 02/03/23 08:45, Michal Hocko wrote:
> >> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
> > [...]
> >> >> I want cgroupv1 to die as much as the next person, but in that specific
> >> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
> >> >> threshold events :/
> >> >
> >> > Could you expand on the usecase?
> >> >
> >>
> >> In this case it's just some middleware leveraging memcontrol cgroups and
> >> setting up callbacks for in-cgroup OOM events. This is a supported feature
> >> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature
> >> parity, but rather one of being in a transitional phase where the
> >> middleware itself hasn't fully migrated to using cgroupv2.
> >
> > How is this related to the RT kernel config? memcg OOM vs any RT
> > assumptions do not really get along well AFAICT.
> >
> 
> Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least
> regarding memory allocations, or lack thereof), but other non-RT-reliant
> tasks on other CPUs come and go, hence the memcg involvement.

So are you suggesting that the RT kernel is used for mixed bag of
workloads with RT and non RT assumptions? Is this really a reasonable
and reliable setup?

What I am trying to evaluate here is whether it makes sense to support
and maintain a non-trivial code for something that might be working
sub-optimally or even not working properly in some corner cases. The
price for the maintenance is certainly not free.
Valentin Schneider March 2, 2023, 2:34 p.m. UTC | #7
On 02/03/23 13:56, Michal Hocko wrote:
> On Thu 02-03-23 12:30:33, Valentin Schneider wrote:
>> On 02/03/23 12:24, Michal Hocko wrote:
>> > On Thu 02-03-23 10:18:31, Valentin Schneider wrote:
>> >> On 02/03/23 08:45, Michal Hocko wrote:
>> >> > On Wed 01-03-23 18:23:19, Valentin Schneider wrote:
>> > [...]
>> >> >> I want cgroupv1 to die as much as the next person, but in that specific
>> >> >> situation I kinda need cgroupv1 to behave somewhat sanely on RT with
>> >> >> threshold events :/
>> >> >
>> >> > Could you expand on the usecase?
>> >> >
>> >>
>> >> In this case it's just some middleware leveraging memcontrol cgroups and
>> >> setting up callbacks for in-cgroup OOM events. This is a supported feature
>> >> in cgroupv2, so this isn't a problem of cgroupv1 vs cgroupv2 feature
>> >> parity, but rather one of being in a transitional phase where the
>> >> middleware itself hasn't fully migrated to using cgroupv2.
>> >
>> > How is this related to the RT kernel config? memcg OOM vs any RT
>> > assumptions do not really get along well AFAICT.
>> >
>>
>> Yep. AIUI the tasks actually relying on RT guarantees DTRT (at least
>> regarding memory allocations, or lack thereof), but other non-RT-reliant
>> tasks on other CPUs come and go, hence the memcg involvement.
>
> So are you suggesting that the RT kernel is used for mixed bag of
> workloads with RT and non RT assumptions? Is this really a reasonable
> and reliable setup?
>

To some extent :-)

> What I am trying to evaluate here is whether it makes sense to support
> and maintain a non-trivial code for something that might be working
> sub-optimally or even not working properly in some corner cases. The
> price for the maintenance is certainly not free.

That's also what I'm trying to make sense of. Either way this will be
frankenkernel territory, the cgroupv1 ship has already sailed for upstream
IMO.
Valentin Schneider March 2, 2023, 7:52 p.m. UTC | #8
On 02/03/23 14:34, Valentin Schneider wrote:
> On 02/03/23 13:56, Michal Hocko wrote:
>> What I am trying to evaluate here is whether it makes sense to support
>> and maintain a non-trivial code for something that might be working
>> sub-optimally or even not working properly in some corner cases. The
>> price for the maintenance is certainly not free.
>
> That's also what I'm trying to make sense of. Either way this will be
> frankenkernel territory, the cgroupv1 ship has already sailed for upstream
> IMO.

So, if someone ends up in a similar situation and considers kludging those
notifications back in:

Don't.


Have a look at:

  memcg_check_events()
  `\
    mem_cgroup_threshold()
    `\
      __mem_cgroup_threshold()
      `\
        eventfd_signal()

Having IRQs off, percpu reads, and the the eventfd signal is a nice recipe
for disaster.

The OOM notification sits outside of that, but any other memcg notification
will cause pain.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index faac50149a222..2cc502a75ef64 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -64,6 +64,7 @@  Brief summary of control files.
 				     threads
  cgroup.procs			     show list of processes
  cgroup.event_control		     an interface for event_fd()
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.usage_in_bytes		     show current usage for memory
 				     (See 5.5 for details)
  memory.memsw.usage_in_bytes	     show current usage for memory+Swap
@@ -75,6 +76,7 @@  Brief summary of control files.
  memory.max_usage_in_bytes	     show max memory usage recorded
  memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
  memory.soft_limit_in_bytes	     set/show soft limit of memory usage
+				     This knob is not available on CONFIG_PREEMPT_RT systems.
  memory.stat			     show various statistics
  memory.use_hierarchy		     set/show hierarchical account enabled
                                      This knob is deprecated and shouldn't be
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8ab2dc75e70ec..0b5117ed2ae08 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,6 +859,9 @@  static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  */
 static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return;
+
 	/* threshold event is triggered in finer grain than soft limit */
 	if (unlikely(mem_cgroup_event_ratelimit(memcg,
 						MEM_CGROUP_TARGET_THRESH))) {
@@ -3731,8 +3734,12 @@  static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 		}
 		break;
 	case RES_SOFT_LIMIT:
-		memcg->soft_limit = nr_pages;
-		ret = 0;
+		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+			ret = -EOPNOTSUPP;
+		} else {
+			memcg->soft_limit = nr_pages;
+			ret = 0;
+		}
 		break;
 	}
 	return ret ?: nbytes;
@@ -4708,6 +4715,9 @@  static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 	char *endp;
 	int ret;
 
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return -EOPNOTSUPP;
+
 	buf = strstrip(buf);
 
 	efd = simple_strtoul(buf, &endp, 10);