diff mbox series

[v1,1/5] mm/shmem: support deterministic charging of tmpfs

Message ID 20211108211959.1750915-2-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v1,1/5] mm/shmem: support deterministic charging of tmpfs | expand

Commit Message

Mina Almasry Nov. 8, 2021, 9:19 p.m. UTC
Add memcg= option to shmem mount.

Users can specify this option at mount time and all data page charges
will be charged to the memcg supplied.

Signed-off-by: Mina Almasry <almasrymina@google.com>

Cc: Michal Hocko <mhocko@suse.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Greg Thelen <gthelen@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Roman Gushchin <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: riel@surriel.com
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: cgroups@vger.kernel.org

---
 fs/super.c                 |   3 ++
 include/linux/fs.h         |   5 ++
 include/linux/memcontrol.h |  46 ++++++++++++++--
 mm/filemap.c               |   2 +-
 mm/memcontrol.c            | 104 ++++++++++++++++++++++++++++++++++++-
 mm/shmem.c                 |  50 +++++++++++++++++-
 6 files changed, 201 insertions(+), 9 deletions(-)

--
2.34.0.rc0.344.g81b53c2807-goog

Comments

Dave Chinner Nov. 8, 2021, 10:10 p.m. UTC | #1
On Mon, Nov 08, 2021 at 01:19:55PM -0800, Mina Almasry wrote:
> Add memcg= option to shmem mount.
> 
> Users can specify this option at mount time and all data page charges
> will be charged to the memcg supplied.
.....
> +/*
> + * Returns the memcg to charge for inode pages.  If non-NULL is returned, caller
> + * must drop reference with css_put().  NULL indicates that the inode does not
> + * have a memcg to charge, so the default process based policy should be used.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_mapping_get_charge_target(struct address_space *mapping)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (!mapping)
> +		return NULL;
> +
> +	rcu_read_lock();
> +	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);

Anything doing pointer chasing to obtain static, unchanging
superblock state is poorly implemented. The s_memcg_to_charge value never
changes, so this code should associate the memcg to charge directly
on the mapping when the mapping is first initialised by the
filesystem. We already do this with things like attaching address
space ops and mapping specific gfp masks (i.e
mapping_set_gfp_mask()), so this association should be set up that
way, too (e.g. mapping_set_memcg_to_charge()).

And because this memcg pointer is static and unchanging for the entire
life of the superblock, the superblock *must* pin the memcg into
memory and that means we can elide the rcu locking altogether in the
fast path for all filesystems that don't support this functionality.
i.e. we can simply do:

	if (!mapping || !mapping->memcg_to_charge)
		return NULL;

And then only if there is a memcg to charge, we do the slow path
locking and lookup stuff...

Cheers,

Dave.
Matthew Wilcox Nov. 8, 2021, 11:41 p.m. UTC | #2
On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote:
> > +	rcu_read_lock();
> > +	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> 
> Anything doing pointer chasing to obtain static, unchanging
> superblock state is poorly implemented. The s_memcg_to_charge value never
> changes, so this code should associate the memcg to charge directly
> on the mapping when the mapping is first initialised by the
> filesystem. We already do this with things like attaching address
> space ops and mapping specific gfp masks (i.e
> mapping_set_gfp_mask()), so this association should be set up that
> way, too (e.g. mapping_set_memcg_to_charge()).

I'm not a fan of enlarging struct address_space with another pointer
unless it's going to be used by all/most filesystems.  If this is
destined to be a shmem-only feature, then it should be in the
shmem_inode instead of the mapping.

If we are to have this for all filesystems, then let's do that properly
and make it generic functionality from its introduction.
Roman Gushchin Nov. 9, 2021, 1:15 a.m. UTC | #3
On Mon, Nov 08, 2021 at 01:19:55PM -0800, Mina Almasry wrote:
> Add memcg= option to shmem mount.
> 
> Users can specify this option at mount time and all data page charges
> will be charged to the memcg supplied.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>

Hello, Mina!

Overall I think it's a useful option and I do see it being used outside
of tmpfs usecase. In certain cases a user might want to use memory
cgroups to limit the amount of pagecache and something like what you've
suggested might be useful. I agree with Michal that it opens some
hard questions about certain corner cases, but I don't think there any
show stoppers (at least as now).

> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Greg Thelen <gthelen@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Roman Gushchin <songmuchun@bytedance.com>

This is clearly not my email.

> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: riel@surriel.com
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: cgroups@vger.kernel.org
> 
> ---
>  fs/super.c                 |   3 ++
>  include/linux/fs.h         |   5 ++
>  include/linux/memcontrol.h |  46 ++++++++++++++--
>  mm/filemap.c               |   2 +-
>  mm/memcontrol.c            | 104 ++++++++++++++++++++++++++++++++++++-
>  mm/shmem.c                 |  50 +++++++++++++++++-
>  6 files changed, 201 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 3bfc0f8fbd5bc..8aafe5e4e6200 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -24,6 +24,7 @@
>  #include <linux/export.h>
>  #include <linux/slab.h>
>  #include <linux/blkdev.h>
> +#include <linux/memcontrol.h>
>  #include <linux/mount.h>
>  #include <linux/security.h>
>  #include <linux/writeback.h>		/* for the emergency remount stuff */
> @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s)
>  	up_write(&s->s_umount);
>  	list_lru_destroy(&s->s_dentry_lru);
>  	list_lru_destroy(&s->s_inode_lru);
> +	mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL);
>  	security_sb_free(s);
>  	put_user_ns(s->s_user_ns);
>  	kfree(s->s_subtype);
> @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s)
>  		WARN_ON(s->s_dentry_lru.node);
>  		WARN_ON(s->s_inode_lru.node);
>  		WARN_ON(!list_empty(&s->s_mounts));
> +		mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL);
>  		security_sb_free(s);
>  		fscrypt_sb_free(s);
>  		put_user_ns(s->s_user_ns);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3afca821df32e..59407b3e7aee3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1567,6 +1567,11 @@ struct super_block {
>  	struct workqueue_struct *s_dio_done_wq;
>  	struct hlist_head s_pins;
> 
> +#ifdef CONFIG_MEMCG
> +	/* memcg to charge for pages allocated to this filesystem */
> +	struct mem_cgroup *s_memcg_to_charge;
> +#endif
> +
>  	/*
>  	 * Owning user namespace and default context in which to
>  	 * interpret filesystem uids, gids, quotas, device nodes,
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c5c403f4be6b..e9a64c1b8295b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -27,6 +27,7 @@ struct obj_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct super_block;
> 
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -689,7 +690,8 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
>  		page_counter_read(&memcg->memory);
>  }
> 
> -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
> +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp,
> +			struct address_space *mapping);
> 
>  /**
>   * mem_cgroup_charge - Charge a newly allocated folio to a cgroup.
> @@ -710,7 +712,7 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
>  {
>  	if (mem_cgroup_disabled())
>  		return 0;
> -	return __mem_cgroup_charge(folio, mm, gfp);
> +	return __mem_cgroup_charge(folio, mm, gfp, NULL);
>  }
> 
>  int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
> @@ -923,6 +925,16 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
>  	return !!(memcg->css.flags & CSS_ONLINE);
>  }
> 
> +bool is_remote_oom(struct mem_cgroup *memcg_under_oom);
> +void mem_cgroup_set_charge_target(struct mem_cgroup **target,
> +				  struct mem_cgroup *memcg);
> +struct mem_cgroup *mem_cgroup_get_from_path(const char *path);
> +/**
> + * User is responsible for providing a buffer @buf of length @len and freeing
> + * it.
> + */
> +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len);
> +
>  void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
>  		int zid, int nr_pages);
> 
> @@ -1217,8 +1229,15 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
>  	return false;
>  }
> 
> -static inline int mem_cgroup_charge(struct folio *folio,
> -		struct mm_struct *mm, gfp_t gfp)
> +static inline int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> +				      gfp_t gfp_mask,
> +				      struct address_space *mapping)
> +{
> +	return 0;
> +}
> +
> +static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
> +				    gfp_t gfp_mask)
>  {
>  	return 0;
>  }
> @@ -1326,6 +1345,25 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
>  {
>  }
> 
> +static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
> +{
> +	return false;
> +}
> +
> +static inline void mem_cgroup_set_charge_target(struct mem_cgroup **target,
> +						struct mem_cgroup *memcg)
> +{
> +}
> +
> +static inline int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf,
> +					      size_t len)
> +{
> +	if (len < 1)
> +		return -EINVAL;
> +	buf[0] = '\0';
> +	return 0;
> +}
> +
>  static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  		int (*fn)(struct task_struct *, void *), void *arg)
>  {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6844c9816a864..75e81dfd2c580 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  	folio->index = index;
> 
>  	if (!huge) {
> -		error = mem_cgroup_charge(folio, NULL, gfp);
> +		error = __mem_cgroup_charge(folio, NULL, gfp, mapping);
>  		VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
>  		if (error)
>  			goto error;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 781605e920153..389d2f2be9674 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2580,6 +2580,103 @@ void mem_cgroup_handle_over_high(void)
>  	css_put(&memcg->css);
>  }
> 
> +/*
> + * Non error return value must eventually be released with css_put().
> + */
> +struct mem_cgroup *mem_cgroup_get_from_path(const char *path)
> +{
> +	struct file *file;
> +	struct cgroup_subsys_state *css;
> +	struct mem_cgroup *memcg;
> +
> +	file = filp_open(path, O_DIRECTORY | O_RDONLY, 0);
> +	if (IS_ERR(file))
> +		return (struct mem_cgroup *)file;
> +
> +	css = css_tryget_online_from_dir(file->f_path.dentry,
> +					 &memory_cgrp_subsys);
> +	if (IS_ERR(css))
> +		memcg = (struct mem_cgroup *)css;
> +	else
> +		memcg = container_of(css, struct mem_cgroup, css);
> +
> +	fput(file);
> +	return memcg;
> +}
> +
> +/*
> + * Get the name of the optional charge target memcg associated with @sb.  This
> + * is the cgroup name, not the cgroup path.
> + */
> +int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len)
> +{
> +	struct mem_cgroup *memcg;
> +	int ret = 0;
> +
> +	buf[0] = '\0';
> +
> +	rcu_read_lock();
> +	memcg = rcu_dereference(sb->s_memcg_to_charge);
> +	if (memcg && !css_tryget_online(&memcg->css))
> +		memcg = NULL;
> +	rcu_read_unlock();
> +
> +	if (!memcg)
> +		return 0;
> +
> +	ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
> +	if (ret >= len / 2)
> +		strcpy(buf, "?");
> +	else {
> +		char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
> +
> +		if (p)
> +			*p = '\0';
> +		else
> +			strcpy(buf, "?");
> +	}
> +
> +	css_put(&memcg->css);
> +	return ret < 0 ? ret : 0;
> +}
> +
> +/*
> + * Set or clear (if @memcg is NULL) charge association from file system to
> + * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
> + * ensure that the cgroup is not deleted during this operation.
> + */
> +void mem_cgroup_set_charge_target(struct mem_cgroup **target,
> +				  struct mem_cgroup *memcg)
> +{
> +	if (memcg)
> +		css_get(&memcg->css);
> +	memcg = xchg(target, memcg);
> +	if (memcg)
> +		css_put(&memcg->css);
> +}
> +
> +/*
> + * Returns the memcg to charge for inode pages.  If non-NULL is returned, caller
> + * must drop reference with css_put().  NULL indicates that the inode does not
> + * have a memcg to charge, so the default process based policy should be used.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_mapping_get_charge_target(struct address_space *mapping)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (!mapping)
> +		return NULL;
> +
> +	rcu_read_lock();
> +	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> +	if (memcg && !css_tryget_online(&memcg->css))
> +		memcg = NULL;
> +	rcu_read_unlock();
> +
> +	return memcg;
> +}
> +
>  static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  			unsigned int nr_pages)
>  {
> @@ -6678,12 +6775,15 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
>  	return ret;
>  }
> 
> -int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
> +int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp,
> +			struct address_space *mapping)
>  {
>  	struct mem_cgroup *memcg;
>  	int ret;
> 
> -	memcg = get_mem_cgroup_from_mm(mm);
> +	memcg = mem_cgroup_mapping_get_charge_target(mapping);
> +	if (!memcg)
> +		memcg = get_mem_cgroup_from_mm(mm);
>  	ret = charge_memcg(folio, memcg, gfp);
>  	css_put(&memcg->css);

This function becomes very non-obvious as it's taking a folio (ex-page),
mm and now mapping as arguments. I'd just add a new wrapper around
charge_memcg() instead.

I'd also merge the next patch into this one.

Thanks!
Dave Chinner Nov. 9, 2021, 1:18 a.m. UTC | #4
On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote:
> On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote:
> > > +	rcu_read_lock();
> > > +	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> > 
> > Anything doing pointer chasing to obtain static, unchanging
> > superblock state is poorly implemented. The s_memcg_to_charge value never
> > changes, so this code should associate the memcg to charge directly
> > on the mapping when the mapping is first initialised by the
> > filesystem. We already do this with things like attaching address
> > space ops and mapping specific gfp masks (i.e
> > mapping_set_gfp_mask()), so this association should be set up that
> > way, too (e.g. mapping_set_memcg_to_charge()).
> 
> I'm not a fan of enlarging struct address_space with another pointer
> unless it's going to be used by all/most filesystems.  If this is
> destined to be a shmem-only feature, then it should be in the
> shmem_inode instead of the mapping.

Neither am I, but I'm also not a fan of the filemap code still
having to drill through the mapping to the host inode just to check
if it needs to do special stuff for shmem inodes on every call that
adds a page to the page cache. This is just as messy and intrusive
and the memcg code really has no business digging about in the
filesystem specific details of the inode behind the mapping.

Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a
null mm context, so deep in the guts it ends getting the memcg from
active_memcg() in get_mem_cgroup_from_mm(). That ends up using
current->active_memcg, so maybe a better approach here is to have
shmem override current->active_memcg via set_active_memcg() before
it enters the generic fs paths and restore it on return...

current_fsmemcg()?

> If we are to have this for all filesystems, then let's do that properly
> and make it generic functionality from its introduction.

Fully agree.

Cheers,

Dave.
Mina Almasry Nov. 9, 2021, 11:56 p.m. UTC | #5
On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote:
> > On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote:
> > > > + rcu_read_lock();
> > > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> > >
> > > Anything doing pointer chasing to obtain static, unchanging
> > > superblock state is poorly implemented. The s_memcg_to_charge value never
> > > changes, so this code should associate the memcg to charge directly
> > > on the mapping when the mapping is first initialised by the
> > > filesystem. We already do this with things like attaching address
> > > space ops and mapping specific gfp masks (i.e
> > > mapping_set_gfp_mask()), so this association should be set up that
> > > way, too (e.g. mapping_set_memcg_to_charge()).
> >
> > I'm not a fan of enlarging struct address_space with another pointer
> > unless it's going to be used by all/most filesystems.  If this is
> > destined to be a shmem-only feature, then it should be in the
> > shmem_inode instead of the mapping.
>
> Neither am I, but I'm also not a fan of the filemap code still
> having to drill through the mapping to the host inode just to check
> if it needs to do special stuff for shmem inodes on every call that
> adds a page to the page cache. This is just as messy and intrusive
> and the memcg code really has no business digging about in the
> filesystem specific details of the inode behind the mapping.
>
> Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a
> null mm context, so deep in the guts it ends getting the memcg from
> active_memcg() in get_mem_cgroup_from_mm(). That ends up using
> current->active_memcg, so maybe a better approach here is to have
> shmem override current->active_memcg via set_active_memcg() before
> it enters the generic fs paths and restore it on return...
>
> current_fsmemcg()?
>

Thank you for providing a detailed alternative. To be honest it seems
a bit brittle to me, as in folks can easily add calls to generic fs
paths forgetting to override the active_memcg and having memory
charged incorrectly, but if there is no other option and we want to
make this a shmem-only feature, I can do this anyway.

> > If we are to have this for all filesystems, then let's do that properly
> > and make it generic functionality from its introduction.
>
> Fully agree.

So the tmpfs feature addresses the first 2 usecases I mention in the
cover letter. For the 3rd usecase I will likely need to extend this
support to 1 disk-based filesystem, and I'm not sure which at this
point. It also looks like Roman has in mind 1 or more use cases and
may extend it to other filesystems as a result. I'm hoping that I can
provide the generic implementation and the tmpfs support and in follow
up patches folks can extend this to other file systems by providing
the fs-specific changes needed for that filesystem.

AFAICT with this patch the work to extend to another file system is to
parse the memcg= option in that filesystem, set the s_memcg_to_charge
on the superblock (or mapping) of that filesystem, and to charge
s_memcg_to_charge in fs specific code paths, so all are fs-specific
changes.

Based on this, it seems to me the suggestion is to hang the
memcg_to_charge off the mapping? I'm not sure if *most/all*
filesystems will eventually support it, but likely more than just
tmpfs.




>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Mina Almasry Nov. 10, 2021, 1:15 a.m. UTC | #6
On Tue, Nov 9, 2021 at 3:56 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Nov 08, 2021 at 11:41:51PM +0000, Matthew Wilcox wrote:
> > > On Tue, Nov 09, 2021 at 09:10:47AM +1100, Dave Chinner wrote:
> > > > > + rcu_read_lock();
> > > > > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
> > > >
> > > > Anything doing pointer chasing to obtain static, unchanging
> > > > superblock state is poorly implemented. The s_memcg_to_charge value never
> > > > changes, so this code should associate the memcg to charge directly
> > > > on the mapping when the mapping is first initialised by the
> > > > filesystem. We already do this with things like attaching address
> > > > space ops and mapping specific gfp masks (i.e
> > > > mapping_set_gfp_mask()), so this association should be set up that
> > > > way, too (e.g. mapping_set_memcg_to_charge()).
> > >
> > > I'm not a fan of enlarging struct address_space with another pointer
> > > unless it's going to be used by all/most filesystems.  If this is
> > > destined to be a shmem-only feature, then it should be in the
> > > shmem_inode instead of the mapping.
> >
> > Neither am I, but I'm also not a fan of the filemap code still
> > having to drill through the mapping to the host inode just to check
> > if it needs to do special stuff for shmem inodes on every call that
> > adds a page to the page cache. This is just as messy and intrusive
> > and the memcg code really has no business digging about in the
> > filesystem specific details of the inode behind the mapping.
> >
> > Hmmm. The mem_cgroup_charge() call in filemap_add_folio() passes a
> > null mm context, so deep in the guts it ends getting the memcg from
> > active_memcg() in get_mem_cgroup_from_mm(). That ends up using
> > current->active_memcg, so maybe a better approach here is to have
> > shmem override current->active_memcg via set_active_memcg() before
> > it enters the generic fs paths and restore it on return...
> >
> > current_fsmemcg()?
> >
>
> Thank you for providing a detailed alternative. To be honest it seems
> a bit brittle to me, as in folks can easily add calls to generic fs
> paths forgetting to override the active_memcg and having memory
> charged incorrectly, but if there is no other option and we want to
> make this a shmem-only feature, I can do this anyway.
>
> > > If we are to have this for all filesystems, then let's do that properly
> > > and make it generic functionality from its introduction.
> >
> > Fully agree.
>
> So the tmpfs feature addresses the first 2 usecases I mention in the
> cover letter. For the 3rd usecase I will likely need to extend this
> support to 1 disk-based filesystem, and I'm not sure which at this
> point. It also looks like Roman has in mind 1 or more use cases and
> may extend it to other filesystems as a result. I'm hoping that I can
> provide the generic implementation and the tmpfs support and in follow
> up patches folks can extend this to other file systems by providing
> the fs-specific changes needed for that filesystem.
>
> AFAICT with this patch the work to extend to another file system is to
> parse the memcg= option in that filesystem, set the s_memcg_to_charge
> on the superblock (or mapping) of that filesystem, and to charge
> s_memcg_to_charge in fs specific code paths, so all are fs-specific
> changes.
>
> Based on this, it seems to me the suggestion is to hang the
> memcg_to_charge off the mapping? I'm not sure if *most/all*
> filesystems will eventually support it, but likely more than just
> tmpfs.
>

Greg actually points out to me off list that the patches - as written
- supports remounting the tmpfs with a different memcg= option, where
future charges will be directed to the new memcg provided by the
option, so s_memcg_to_charge is more of a property of the super_block.

We could explicitly disable remounting with a different memcg=, but
I'm hoping to preserve that support if possible. The only way to
preserve it I think and avoid the pointer chasing in fs generic paths
is for shmem to set_active_memcg() before calling into the generic fs
code, but all other fs that apply this feature will need to do the
same. I'm not sure if that's the better option. Let me know what you
think please. Thanks!

>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
Shakeel Butt Nov. 15, 2021, 5:53 p.m. UTC | #7
On Mon, Nov 8, 2021 at 5:18 PM Dave Chinner <david@fromorbit.com> wrote:
>
[...]
>
> > If we are to have this for all filesystems, then let's do that properly
> > and make it generic functionality from its introduction.
>
> Fully agree.
>

Mina, I think supporting all filesystems might be a much cleaner
solution than adding fs specific code.

We need to:

1) Add memcg option handling in vfs_parse_fs_param() before fs
specific param handling.
2) Add a new page cache memcg charging interface (similar to swap).

With (1), no need to change any fs specific code.

With (2), fs codepaths will be free of memcg specific handling. This
new interface will be used in __filemap_add_folio(),
shmem_add_to_page_cache() and collapse_file().

thanks,
Shakeel
diff mbox series

Patch

diff --git a/fs/super.c b/fs/super.c
index 3bfc0f8fbd5bc..8aafe5e4e6200 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -24,6 +24,7 @@ 
 #include <linux/export.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/memcontrol.h>
 #include <linux/mount.h>
 #include <linux/security.h>
 #include <linux/writeback.h>		/* for the emergency remount stuff */
@@ -180,6 +181,7 @@  static void destroy_unused_super(struct super_block *s)
 	up_write(&s->s_umount);
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL);
 	security_sb_free(s);
 	put_user_ns(s->s_user_ns);
 	kfree(s->s_subtype);
@@ -292,6 +294,7 @@  static void __put_super(struct super_block *s)
 		WARN_ON(s->s_dentry_lru.node);
 		WARN_ON(s->s_inode_lru.node);
 		WARN_ON(!list_empty(&s->s_mounts));
+		mem_cgroup_set_charge_target(&s->s_memcg_to_charge, NULL);
 		security_sb_free(s);
 		fscrypt_sb_free(s);
 		put_user_ns(s->s_user_ns);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3afca821df32e..59407b3e7aee3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1567,6 +1567,11 @@  struct super_block {
 	struct workqueue_struct *s_dio_done_wq;
 	struct hlist_head s_pins;

+#ifdef CONFIG_MEMCG
+	/* memcg to charge for pages allocated to this filesystem */
+	struct mem_cgroup *s_memcg_to_charge;
+#endif
+
 	/*
 	 * Owning user namespace and default context in which to
 	 * interpret filesystem uids, gids, quotas, device nodes,
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c5c403f4be6b..e9a64c1b8295b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,6 +27,7 @@  struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct super_block;

 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -689,7 +690,8 @@  static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
 		page_counter_read(&memcg->memory);
 }

-int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
+int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp,
+			struct address_space *mapping);

 /**
  * mem_cgroup_charge - Charge a newly allocated folio to a cgroup.
@@ -710,7 +712,7 @@  static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
 {
 	if (mem_cgroup_disabled())
 		return 0;
-	return __mem_cgroup_charge(folio, mm, gfp);
+	return __mem_cgroup_charge(folio, mm, gfp, NULL);
 }

 int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
@@ -923,6 +925,16 @@  static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
 	return !!(memcg->css.flags & CSS_ONLINE);
 }

+bool is_remote_oom(struct mem_cgroup *memcg_under_oom);
+void mem_cgroup_set_charge_target(struct mem_cgroup **target,
+				  struct mem_cgroup *memcg);
+struct mem_cgroup *mem_cgroup_get_from_path(const char *path);
+/**
+ * User is responsible for providing a buffer @buf of length @len and freeing
+ * it.
+ */
+int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len);
+
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int zid, int nr_pages);

@@ -1217,8 +1229,15 @@  static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
 	return false;
 }

-static inline int mem_cgroup_charge(struct folio *folio,
-		struct mm_struct *mm, gfp_t gfp)
+static inline int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
+				      gfp_t gfp_mask,
+				      struct address_space *mapping)
+{
+	return 0;
+}
+
+static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm,
+				    gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -1326,6 +1345,25 @@  static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
 {
 }

+static inline bool is_remote_oom(struct mem_cgroup *memcg_under_oom)
+{
+	return false;
+}
+
+static inline void mem_cgroup_set_charge_target(struct mem_cgroup **target,
+						struct mem_cgroup *memcg)
+{
+}
+
+static inline int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf,
+					      size_t len)
+{
+	if (len < 1)
+		return -EINVAL;
+	buf[0] = '\0';
+	return 0;
+}
+
 static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 		int (*fn)(struct task_struct *, void *), void *arg)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index 6844c9816a864..75e81dfd2c580 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -903,7 +903,7 @@  noinline int __filemap_add_folio(struct address_space *mapping,
 	folio->index = index;

 	if (!huge) {
-		error = mem_cgroup_charge(folio, NULL, gfp);
+		error = __mem_cgroup_charge(folio, NULL, gfp, mapping);
 		VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
 		if (error)
 			goto error;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 781605e920153..389d2f2be9674 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2580,6 +2580,103 @@  void mem_cgroup_handle_over_high(void)
 	css_put(&memcg->css);
 }

+/*
+ * Non error return value must eventually be released with css_put().
+ */
+struct mem_cgroup *mem_cgroup_get_from_path(const char *path)
+{
+	struct file *file;
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
+
+	file = filp_open(path, O_DIRECTORY | O_RDONLY, 0);
+	if (IS_ERR(file))
+		return (struct mem_cgroup *)file;
+
+	css = css_tryget_online_from_dir(file->f_path.dentry,
+					 &memory_cgrp_subsys);
+	if (IS_ERR(css))
+		memcg = (struct mem_cgroup *)css;
+	else
+		memcg = container_of(css, struct mem_cgroup, css);
+
+	fput(file);
+	return memcg;
+}
+
+/*
+ * Get the name of the optional charge target memcg associated with @sb.  This
+ * is the cgroup name, not the cgroup path.
+ */
+int mem_cgroup_get_name_from_sb(struct super_block *sb, char *buf, size_t len)
+{
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	buf[0] = '\0';
+
+	rcu_read_lock();
+	memcg = rcu_dereference(sb->s_memcg_to_charge);
+	if (memcg && !css_tryget_online(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	if (!memcg)
+		return 0;
+
+	ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
+	if (ret >= len / 2)
+		strcpy(buf, "?");
+	else {
+		char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
+
+		if (p)
+			*p = '\0';
+		else
+			strcpy(buf, "?");
+	}
+
+	css_put(&memcg->css);
+	return ret < 0 ? ret : 0;
+}
+
+/*
+ * Set or clear (if @memcg is NULL) charge association from file system to
+ * memcg.  If @memcg != NULL, then a css reference must be held by the caller to
+ * ensure that the cgroup is not deleted during this operation.
+ */
+void mem_cgroup_set_charge_target(struct mem_cgroup **target,
+				  struct mem_cgroup *memcg)
+{
+	if (memcg)
+		css_get(&memcg->css);
+	memcg = xchg(target, memcg);
+	if (memcg)
+		css_put(&memcg->css);
+}
+
+/*
+ * Returns the memcg to charge for inode pages.  If non-NULL is returned, caller
+ * must drop reference with css_put().  NULL indicates that the inode does not
+ * have a memcg to charge, so the default process based policy should be used.
+ */
+static struct mem_cgroup *
+mem_cgroup_mapping_get_charge_target(struct address_space *mapping)
+{
+	struct mem_cgroup *memcg;
+
+	if (!mapping)
+		return NULL;
+
+	rcu_read_lock();
+	memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge);
+	if (memcg && !css_tryget_online(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	return memcg;
+}
+
 static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 			unsigned int nr_pages)
 {
@@ -6678,12 +6775,15 @@  static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 	return ret;
 }

-int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp)
+int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp,
+			struct address_space *mapping)
 {
 	struct mem_cgroup *memcg;
 	int ret;

-	memcg = get_mem_cgroup_from_mm(mm);
+	memcg = mem_cgroup_mapping_get_charge_target(mapping);
+	if (!memcg)
+		memcg = get_mem_cgroup_from_mm(mm);
 	ret = charge_memcg(folio, memcg, gfp);
 	css_put(&memcg->css);

diff --git a/mm/shmem.c b/mm/shmem.c
index 23c91a8beb781..01510fa8ab725 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,10 +115,14 @@  struct shmem_options {
 	bool full_inums;
 	int huge;
 	int seen;
+#if CONFIG_MEMCG
+	struct mem_cgroup *memcg;
+#endif
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
+#define SHMEM_SEEN_MEMCG 16
 };

 #ifdef CONFIG_TMPFS
@@ -709,7 +713,8 @@  static int shmem_add_to_page_cache(struct page *page,
 	page->index = index;

 	if (!PageSwapCache(page)) {
-		error = mem_cgroup_charge(page_folio(page), charge_mm, gfp);
+		error = __mem_cgroup_charge(page_folio(page), charge_mm, gfp,
+					    mapping);
 		if (error) {
 			if (PageTransHuge(page)) {
 				count_vm_event(THP_FILE_FALLBACK);
@@ -3342,6 +3347,7 @@  static const struct export_operations shmem_export_ops = {
 enum shmem_param {
 	Opt_gid,
 	Opt_huge,
+	Opt_memcg,
 	Opt_mode,
 	Opt_mpol,
 	Opt_nr_blocks,
@@ -3363,6 +3369,7 @@  static const struct constant_table shmem_param_enums_huge[] = {
 const struct fs_parameter_spec shmem_fs_parameters[] = {
 	fsparam_u32   ("gid",		Opt_gid),
 	fsparam_enum  ("huge",		Opt_huge,  shmem_param_enums_huge),
+	fsparam_string("memcg",		Opt_memcg),
 	fsparam_u32oct("mode",		Opt_mode),
 	fsparam_string("mpol",		Opt_mpol),
 	fsparam_string("nr_blocks",	Opt_nr_blocks),
@@ -3379,6 +3386,7 @@  static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	struct shmem_options *ctx = fc->fs_private;
 	struct fs_parse_result result;
 	unsigned long long size;
+	struct mem_cgroup *memcg;
 	char *rest;
 	int opt;

@@ -3412,6 +3420,17 @@  static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 			goto bad_value;
 		ctx->seen |= SHMEM_SEEN_INODES;
 		break;
+#ifdef CONFIG_MEMCG
+	case Opt_memcg:
+		if (ctx->memcg)
+			css_put(&ctx->memcg->css);
+		memcg = mem_cgroup_get_from_path(param->string);
+		if (IS_ERR(memcg))
+			goto bad_value;
+		ctx->memcg = memcg;
+		ctx->seen |= SHMEM_SEEN_MEMCG;
+		break;
+#endif
 	case Opt_mode:
 		ctx->mode = result.uint_32 & 07777;
 		break;
@@ -3573,6 +3592,14 @@  static int shmem_reconfigure(struct fs_context *fc)
 	}
 	raw_spin_unlock(&sbinfo->stat_lock);
 	mpol_put(mpol);
+#if CONFIG_MEMCG
+	if (ctx->seen & SHMEM_SEEN_MEMCG && ctx->memcg) {
+		mem_cgroup_set_charge_target(&fc->root->d_sb->s_memcg_to_charge,
+					     ctx->memcg);
+		css_put(&ctx->memcg->css);
+		ctx->memcg = NULL;
+	}
+#endif
 	return 0;
 out:
 	raw_spin_unlock(&sbinfo->stat_lock);
@@ -3582,6 +3609,11 @@  static int shmem_reconfigure(struct fs_context *fc)
 static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
+	int err;
+	char *buf = __getname();
+
+	if (!buf)
+		return -ENOMEM;

 	if (sbinfo->max_blocks != shmem_default_max_blocks())
 		seq_printf(seq, ",size=%luk",
@@ -3625,7 +3657,13 @@  static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 		seq_printf(seq, ",huge=%s", shmem_format_huge(sbinfo->huge));
 #endif
 	shmem_show_mpol(seq, sbinfo->mpol);
-	return 0;
+	/* Memory cgroup binding: memcg=cgroup_name */
+	err = mem_cgroup_get_name_from_sb(root->d_sb, buf, PATH_MAX);
+	if (!err && buf[0] != '\0')
+		seq_printf(seq, ",memcg=%s", buf);
+
+	__putname(buf);
+	return err;
 }

 #endif /* CONFIG_TMPFS */
@@ -3710,6 +3748,14 @@  static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_flags |= SB_POSIXACL;
 #endif
 	uuid_gen(&sb->s_uuid);
+#if CONFIG_MEMCG
+	if (ctx->memcg) {
+		mem_cgroup_set_charge_target(&sb->s_memcg_to_charge,
+					     ctx->memcg);
+		css_put(&ctx->memcg->css);
+		ctx->memcg = NULL;
+	}
+#endif

 	inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, VM_NORESERVE);
 	if (!inode)