mbox series

[v10,0/3] Charge loop device i/o to issuing cgroup

Message ID 20210316153655.500806-1-schatzberg.dan@gmail.com (mailing list archive)
Headers show
Series Charge loop device i/o to issuing cgroup | expand

Message

Dan Schatzberg March 16, 2021, 3:36 p.m. UTC
No major changes, just rebasing and resubmitting

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
  patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
    sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c       | 248 ++++++++++++++++++++++++++++++-------
 drivers/block/loop.h       |  15 ++-
 include/linux/memcontrol.h |  11 ++
 kernel/cgroup/cgroup.c     |   1 +
 mm/filemap.c               |   2 +-
 mm/memcontrol.c            |  15 ++-
 mm/shmem.c                 |   4 +-
 7 files changed, 242 insertions(+), 54 deletions(-)

Comments

Jens Axboe March 17, 2021, 10:30 p.m. UTC | #1
On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> No major changes, just rebasing and resubmitting

Applied for 5.13, thanks.
Shakeel Butt March 18, 2021, 3:53 p.m. UTC | #2
On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > No major changes, just rebasing and resubmitting
>
> Applied for 5.13, thanks.
>

I have requested a couple of changes in the patch series. Can this
applied series still be changed or new patches are required?
Jens Axboe March 18, 2021, 4 p.m. UTC | #3
On 3/18/21 9:53 AM, Shakeel Butt wrote:
> On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
>>> No major changes, just rebasing and resubmitting
>>
>> Applied for 5.13, thanks.
>>
> 
> I have requested a couple of changes in the patch series. Can this
> applied series still be changed or new patches are required?

I have nothing sitting on top of it for now, so as far as I'm concerned
we can apply a new series instead. Then we can also fold in that fix
from Colin that he posted this morning...
Andrew Morton March 18, 2021, 11:46 p.m. UTC | #4
On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> >>> No major changes, just rebasing and resubmitting
> >>
> >> Applied for 5.13, thanks.
> >>
> > 
> > I have requested a couple of changes in the patch series. Can this
> > applied series still be changed or new patches are required?
> 
> I have nothing sitting on top of it for now, so as far as I'm concerned
> we can apply a new series instead. Then we can also fold in that fix
> from Colin that he posted this morning...

The collision in memcontrol.c is a pain, but I guess as this is mainly
a loop patch, the block tree is an appropriate route.

Here's the collision between "mm: Charge active memcg when no mm is
set" and Shakeels's
https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com


--- mm/memcontrol.c
+++ mm/memcontrol.c
@@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 		rcu_read_unlock();
 	}
 
-	if (!memcg)
-		memcg = get_mem_cgroup_from_mm(mm);
+	if (!memcg) {
+		if (!mm) {
+			memcg = get_mem_cgroup_from_current();
+			if (!memcg)
+				memcg = get_mem_cgroup_from_mm(current->mm);
+		} else {
+			memcg = get_mem_cgroup_from_mm(mm);
+		}
+	}
 
 	ret = try_charge(memcg, gfp_mask, nr_pages);
 	if (ret)


Which I resolved thusly:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
	struct mem_cgroup *memcg;
	int ret;

	if (mem_cgroup_disabled())
		return 0;

	if (!mm) {
		memcg = get_mem_cgroup_from_current();
		(!memcg)
			memcg = get_mem_cgroup_from_mm(current->mm);
	} else {
		memcg = get_mem_cgroup_from_mm(mm);
	}
				
	ret = __mem_cgroup_charge(page, memcg, gfp_mask);
	css_put(&memcg->css);

	return ret;
}
Shakeel Butt March 19, 2021, 12:56 a.m. UTC | #5
On Thu, Mar 18, 2021 at 4:46 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe <axboe@kernel.dk> wrote:
>
> > On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <axboe@kernel.dk> wrote:
> > >>
> > >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > >>> No major changes, just rebasing and resubmitting
> > >>
> > >> Applied for 5.13, thanks.
> > >>
> > >
> > > I have requested a couple of changes in the patch series. Can this
> > > applied series still be changed or new patches are required?
> >
> > I have nothing sitting on top of it for now, so as far as I'm concerned
> > we can apply a new series instead. Then we can also fold in that fix
> > from Colin that he posted this morning...
>
> The collision in memcontrol.c is a pain, but I guess as this is mainly
> a loop patch, the block tree is an appropriate route.
>
> Here's the collision between "mm: Charge active memcg when no mm is
> set" and Shakeels's
> https://lkml.kernel.org/r/20210305212639.775498-1-shakeelb@google.com
>
>
> --- mm/memcontrol.c
> +++ mm/memcontrol.c
> @@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>                 rcu_read_unlock();
>         }
>
> -       if (!memcg)
> -               memcg = get_mem_cgroup_from_mm(mm);
> +       if (!memcg) {
> +               if (!mm) {
> +                       memcg = get_mem_cgroup_from_current();
> +                       if (!memcg)
> +                               memcg = get_mem_cgroup_from_mm(current->mm);
> +               } else {
> +                       memcg = get_mem_cgroup_from_mm(mm);
> +               }
> +       }
>
>         ret = try_charge(memcg, gfp_mask, nr_pages);
>         if (ret)
>
>
> Which I resolved thusly:
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
>         struct mem_cgroup *memcg;
>         int ret;
>
>         if (mem_cgroup_disabled())
>                 return 0;
>
>         if (!mm) {
>                 memcg = get_mem_cgroup_from_current();
>                 (!memcg)
>                         memcg = get_mem_cgroup_from_mm(current->mm);
>         } else {
>                 memcg = get_mem_cgroup_from_mm(mm);
>         }
>
>         ret = __mem_cgroup_charge(page, memcg, gfp_mask);
>         css_put(&memcg->css);
>
>         return ret;
> }
>

We need something similar for mem_cgroup_swapin_charge_page() as well.

It is better to take this series in mm tree and Jens is ok with that [1].

[1] https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d2c4@kernel.dk/
Dan Schatzberg March 19, 2021, 3:51 p.m. UTC | #6
On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
> 
> We need something similar for mem_cgroup_swapin_charge_page() as well.
> 
> It is better to take this series in mm tree and Jens is ok with that [1].
> 
> [1] https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d2c4@kernel.dk/

It sounds like there are no concerns about the loop-related work in
the patch series. I'll rebase on the mm tree and resubmit.
Shakeel Butt March 19, 2021, 4:20 p.m. UTC | #7
On Fri, Mar 19, 2021 at 8:51 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
> >
> > We need something similar for mem_cgroup_swapin_charge_page() as well.
> >
> > It is better to take this series in mm tree and Jens is ok with that [1].
> >
> > [1] https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d2c4@kernel.dk/
>
> It sounds like there are no concerns about the loop-related work in
> the patch series. I'll rebase on the mm tree and resubmit.

One suggestion would be to make get_mem_cgroup_from_mm() more generic
(i.e. handle !mm && active_memcg() case) and avoid
get_mem_cgroup_from_current() as it might go away.
Dan Schatzberg March 19, 2021, 4:27 p.m. UTC | #8
On Fri, Mar 19, 2021 at 09:20:16AM -0700, Shakeel Butt wrote:
> One suggestion would be to make get_mem_cgroup_from_mm() more generic
> (i.e. handle !mm && active_memcg() case) and avoid
> get_mem_cgroup_from_current() as it might go away.

Yeah, that occurred to me as well. I'll take a stab at doing that.