diff mbox series

[v2] hugetlb: Add hugetlb.*.numa_stat file

Message ID 20211020190952.2658759-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v2] hugetlb: Add hugetlb.*.numa_stat file | expand

Commit Message

Mina Almasry Oct. 20, 2021, 7:09 p.m. UTC
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.

Currently this techinically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped on unmapped.
2. numa_maps are per process and need to be aggregaged across all
   proceses in the cgroup. For shared memory this is more involved as
   the userspace needs to make sure it doesn't double count shared
   mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
   to the contention on this lock.

For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.

On cgroup-v2:
   cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0

On cgroup-v1:
   cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0
   hierarichal_total=2097152 N0=2097152 N1=0

This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jue Wang <juew@google.com>
Cc: Yang Yao <ygyao@google.com>
Cc: Joanna Li <joannali@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

Changes in v2:
- Fix warning Reported-by: kernel test robot <lkp@intel.com>
---
 .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
 Documentation/admin-guide/cgroup-v2.rst       |  7 ++
 include/linux/hugetlb.h                       |  4 +-
 include/linux/hugetlb_cgroup.h                |  7 ++
 mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
 .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
 6 files changed, 113 insertions(+), 11 deletions(-)

--
2.33.0.1079.g6e70778dc9-goog

Comments

Mike Kravetz Oct. 27, 2021, 11:35 p.m. UTC | #1
On 10/20/21 12:09 PM, Mina Almasry wrote:
> For hugetlb backed jobs/VMs it's critical to understand the numa
> information for the memory backing these jobs to deliver optimal
> performance.
> 
> Currently this techinically can be queried from /proc/self/numa_maps, but

                 technically

> there are significant issues with that. Namely:
> 1. Memory can be mapped on unmapped.
> 2. numa_maps are per process and need to be aggregaged across all

                                              aggregated

>    proceses in the cgroup. For shared memory this is more involved as

     processes

>    the userspace needs to make sure it doesn't double count shared
>    mappings.
> 3. I believe querying numa_maps needs to hold the mmap_lock which adds
>    to the contention on this lock.
> 
> For these reasons I propose simply adding hugetlb.*.numa_stat file,
> which shows the numa information of the cgroup similarly to
> memory.numa_stat.
> 
> On cgroup-v2:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat

I know that cgroup can be mounted anywhere, but I think the conventional
locations are mentioned in the man page.  Why not use them?

v2 '/sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat'

>    total=2097152 N0=2097152 N1=0
> 
> On cgroup-v1:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat

v1 '/sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat'

>    total=2097152 N0=2097152 N1=0
>    hierarichal_total=2097152 N0=2097152 N1=0
> 
> This patch was tested manually by allocating hugetlb memory and querying
> the hugetlb.*.numa_stat file of the cgroup and its parents.

I have no objections to adding this functionality, and do not see any
blocking issues in hugetlb code.  However, it would be GREAT if someone
more familiar/experienced with cgroups would comment.  My cgroup
experience is very limited.

> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jue Wang <juew@google.com>
> Cc: Yang Yao <ygyao@google.com>
> Cc: Joanna Li <joannali@google.com>
> Cc: Cannon Matthews <cannonmatthews@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> 
> ---
> 
> Changes in v2:
> - Fix warning Reported-by: kernel test robot <lkp@intel.com>
> ---
>  .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
>  Documentation/admin-guide/cgroup-v2.rst       |  7 ++
>  include/linux/hugetlb.h                       |  4 +-
>  include/linux/hugetlb_cgroup.h                |  7 ++
>  mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
>  .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
>  6 files changed, 113 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> index 338f2c7d7a1c..0fa724d82abb 100644
> --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> @@ -29,12 +29,14 @@ Brief summary of control files::
>   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
>   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
> 
>  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
>  files include::
> 
>    hugetlb.1GB.limit_in_bytes
>    hugetlb.1GB.max_usage_in_bytes
> +  hugetlb.1GB.numa_stat
>    hugetlb.1GB.usage_in_bytes
>    hugetlb.1GB.failcnt
>    hugetlb.1GB.rsvd.limit_in_bytes
> @@ -43,6 +45,7 @@ files include::
>    hugetlb.1GB.rsvd.failcnt
>    hugetlb.64KB.limit_in_bytes
>    hugetlb.64KB.max_usage_in_bytes
> +  hugetlb.64KB.numa_stat
>    hugetlb.64KB.usage_in_bytes
>    hugetlb.64KB.failcnt
>    hugetlb.64KB.rsvd.limit_in_bytes
> @@ -51,6 +54,7 @@ files include::
>    hugetlb.64KB.rsvd.failcnt
>    hugetlb.32MB.limit_in_bytes
>    hugetlb.32MB.max_usage_in_bytes
> +  hugetlb.32MB.numa_stat
>    hugetlb.32MB.usage_in_bytes
>    hugetlb.32MB.failcnt
>    hugetlb.32MB.rsvd.limit_in_bytes
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 4d8c27eca96b..8ba0d6aadd2c 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2252,6 +2252,13 @@ HugeTLB Interface Files
>  	are local to the cgroup i.e. not hierarchical. The file modified event
>  	generated on this file reflects only the local events.
> 
> +  hugetlb.<hugepagesize>.numa_stat
> +	Similar to memory.numa_stat, it shows the numa information of the
> +	memory in this cgroup:

        hugetlb pages of <hugepagesize> in this cgroup.  Only active in 
	use hugetlb pages are included.  The per-node values are in bytes.

> +
> +	/dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
> +	total=0 N0=0 N1=0

I would not include the 'example' above as nothing like this is used
elsewhere in the file.

> +
>  Misc
>  ----
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..0445faaa636e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -613,8 +613,8 @@ struct hstate {
>  #endif
>  #ifdef CONFIG_CGROUP_HUGETLB
>  	/* cgroup control files */
> -	struct cftype cgroup_files_dfl[7];
> -	struct cftype cgroup_files_legacy[9];
> +	struct cftype cgroup_files_dfl[8];
> +	struct cftype cgroup_files_legacy[10];
>  #endif
>  	char name[HSTATE_NAME_LEN];
>  };
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index c137396129db..54ff6ec68ed3 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
>  	HUGETLB_NR_MEMORY_EVENTS,
>  };
> 
> +struct hugetlb_cgroup_per_node {
> +	/* hugetlb usage in bytes over all hstates. */
> +	unsigned long usage[HUGE_MAX_HSTATE];
> +};
> +
>  struct hugetlb_cgroup {
>  	struct cgroup_subsys_state css;
> 
> @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
> 
>  	/* Handle for "hugetlb.events.local" */
>  	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> +
> +	struct hugetlb_cgroup_per_node *nodeinfo[];
>  };
> 
>  static inline struct hugetlb_cgroup *
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 5383023d0cca..4b807292f7e8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>  				struct hugetlb_cgroup *parent_h_cgroup)
>  {
>  	int idx;
> +	int node;
> 
>  	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
>  		struct page_counter *fault_parent = NULL;
> @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>  			limit);
>  		VM_BUG_ON(ret);
>  	}
> +
> +	for_each_node(node) {
> +		/* Set node_to_alloc to -1 for offline nodes. */
> +		int node_to_alloc =
> +			node_state(node, N_NORMAL_MEMORY) ? node : -1;
> +		h_cgroup->nodeinfo[node] =
> +			kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> +				     GFP_KERNEL, node_to_alloc);

Do you need to handle kzalloc_node failure here?

alloc_mem_cgroup_per_node_info provides similar functionality and has
the following comment.

	 * TODO: this routine can waste much memory for nodes which will
	 *       never be onlined. It's better to use memory hotplug callback
	 *       function.

Extra credit if you do this here and in alloc_mem_cgroup_per_node_info. :)

> +	}
>  }
> 
>  static struct cgroup_subsys_state *
> @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  	struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
>  	struct hugetlb_cgroup *h_cgroup;
> 
> -	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> +	unsigned int size =
> +		sizeof(*h_cgroup) +
> +		MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
> +	h_cgroup = kzalloc(size, GFP_KERNEL);
>  	if (!h_cgroup)
>  		return ERR_PTR(-ENOMEM);
> 
> @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>  		return;
> 
>  	__set_hugetlb_cgroup(page, h_cg, rsvd);
> -	return;
> +	if (!rsvd && h_cg)
> +		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
> +								 << PAGE_SHIFT;
>  }
> 
>  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> 
>  	if (rsvd)
>  		css_put(&h_cg->css);
> -
> +	else
> +		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
> +								 << PAGE_SHIFT;
>  	return;
>  }
> 
> @@ -421,6 +438,56 @@ enum {
>  	RES_RSVD_FAILCNT,
>  };
> 
> +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> +{
> +	int nid;
> +	struct cftype *cft = seq_cft(seq);
> +	int idx = MEMFILE_IDX(cft->private);
> +	bool legacy = MEMFILE_ATTR(cft->private);
> +	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> +	struct cgroup_subsys_state *css;
> +	unsigned long usage;
> +
> +	if (legacy) {
> +		/* Add up usage across all nodes for the non-hierarchical total. */
> +		usage = 0;
> +		for_each_node_state(nid, N_MEMORY)
> +			usage += h_cg->nodeinfo[nid]->usage[idx];
> +		seq_printf(seq, "total=%lu", usage);
> +
> +		/* Simply print the per-node usage for the non-hierarchical total. */
> +		for_each_node_state(nid, N_MEMORY)
> +			seq_printf(seq, " N%d=%lu", nid,
> +				   h_cg->nodeinfo[nid]->usage[idx]);
> +		seq_putc(seq, '\n');
> +	}
> +
> +	/* The hierarchical total is pretty much the value recorded by the
> +	 * counter, so use that.

'pretty much' ... so use that.  Here is one place in particular where
input from someone with more cgroup experience would be appreciated.
Mina Almasry Oct. 31, 2021, 8:39 p.m. UTC | #2
On Wed, Oct 27, 2021 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 10/20/21 12:09 PM, Mina Almasry wrote:
> > For hugetlb backed jobs/VMs it's critical to understand the numa
> > information for the memory backing these jobs to deliver optimal
> > performance.
> >
> > Currently this techinically can be queried from /proc/self/numa_maps, but
>
>                  technically
>
> > there are significant issues with that. Namely:
> > 1. Memory can be mapped on unmapped.
> > 2. numa_maps are per process and need to be aggregaged across all
>
>                                               aggregated
>
> >    proceses in the cgroup. For shared memory this is more involved as
>
>      processes
>

Very sorry for the typos.

> >    the userspace needs to make sure it doesn't double count shared
> >    mappings.
> > 3. I believe querying numa_maps needs to hold the mmap_lock which adds
> >    to the contention on this lock.
> >
> > For these reasons I propose simply adding hugetlb.*.numa_stat file,
> > which shows the numa information of the cgroup similarly to
> > memory.numa_stat.
> >
> > On cgroup-v2:
> >    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
>
> I know that cgroup can be mounted anywhere, but I think the conventional
> locations are mentioned in the man page.  Why not use them?
>
> v2 '/sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat'
>
> >    total=2097152 N0=2097152 N1=0
> >
> > On cgroup-v1:
> >    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
>
> v1 '/sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat'
>
> >    total=2097152 N0=2097152 N1=0
> >    hierarichal_total=2097152 N0=2097152 N1=0
> >
> > This patch was tested manually by allocating hugetlb memory and querying
> > the hugetlb.*.numa_stat file of the cgroup and its parents.
>
> I have no objections to adding this functionality, and do not see any
> blocking issues in hugetlb code.  However, it would be GREAT if someone
> more familiar/experienced with cgroups would comment.  My cgroup
> experience is very limited.
>

I will send V2 to Shakeel as well from our team and ask him to take a
look and he has more than enough experience to review. If anyone else
reading with cgroups knowledge can Review/Ack that would be great.

It's possible I'm a bit off the mark here, but FWIW I don't think
there is much that is continuous or ambiguous here. For
memory.numa_stat it's a bit nuanced because there is anon, rss, shmem,
etc.. but for hugetlb we just have hugetlb memory and the only care
needed is hierarchical vs not in cgroups v1.

> > 
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Miaohe Lin <linmiaohe@huawei.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Muchun Song <songmuchun@bytedance.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Jue Wang <juew@google.com>
> > Cc: Yang Yao <ygyao@google.com>
> > Cc: Joanna Li <joannali@google.com>
> > Cc: Cannon Matthews <cannonmatthews@google.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >
> > ---
> >
> > Changes in v2:
> > - Fix warning Reported-by: kernel test robot <lkp@intel.com>
> > ---
> >  .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
> >  Documentation/admin-guide/cgroup-v2.rst       |  7 ++
> >  include/linux/hugetlb.h                       |  4 +-
> >  include/linux/hugetlb_cgroup.h                |  7 ++
> >  mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
> >  .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
> >  6 files changed, 113 insertions(+), 11 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > index 338f2c7d7a1c..0fa724d82abb 100644
> > --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > @@ -29,12 +29,14 @@ Brief summary of control files::
> >   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
> >   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
> >   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> > + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
> >
> >  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
> >  files include::
> >
> >    hugetlb.1GB.limit_in_bytes
> >    hugetlb.1GB.max_usage_in_bytes
> > +  hugetlb.1GB.numa_stat
> >    hugetlb.1GB.usage_in_bytes
> >    hugetlb.1GB.failcnt
> >    hugetlb.1GB.rsvd.limit_in_bytes
> > @@ -43,6 +45,7 @@ files include::
> >    hugetlb.1GB.rsvd.failcnt
> >    hugetlb.64KB.limit_in_bytes
> >    hugetlb.64KB.max_usage_in_bytes
> > +  hugetlb.64KB.numa_stat
> >    hugetlb.64KB.usage_in_bytes
> >    hugetlb.64KB.failcnt
> >    hugetlb.64KB.rsvd.limit_in_bytes
> > @@ -51,6 +54,7 @@ files include::
> >    hugetlb.64KB.rsvd.failcnt
> >    hugetlb.32MB.limit_in_bytes
> >    hugetlb.32MB.max_usage_in_bytes
> > +  hugetlb.32MB.numa_stat
> >    hugetlb.32MB.usage_in_bytes
> >    hugetlb.32MB.failcnt
> >    hugetlb.32MB.rsvd.limit_in_bytes
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 4d8c27eca96b..8ba0d6aadd2c 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2252,6 +2252,13 @@ HugeTLB Interface Files
> >       are local to the cgroup i.e. not hierarchical. The file modified event
> >       generated on this file reflects only the local events.
> >
> > +  hugetlb.<hugepagesize>.numa_stat
> > +     Similar to memory.numa_stat, it shows the numa information of the
> > +     memory in this cgroup:
>
>         hugetlb pages of <hugepagesize> in this cgroup.  Only active in
>         use hugetlb pages are included.  The per-node values are in bytes.
>
> > +
> > +     /dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
> > +     total=0 N0=0 N1=0
>
> I would not include the 'example' above as nothing like this is used
> elsewhere in the file.
>
> > +
> >  Misc
> >  ----
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 1faebe1cd0ed..0445faaa636e 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -613,8 +613,8 @@ struct hstate {
> >  #endif
> >  #ifdef CONFIG_CGROUP_HUGETLB
> >       /* cgroup control files */
> > -     struct cftype cgroup_files_dfl[7];
> > -     struct cftype cgroup_files_legacy[9];
> > +     struct cftype cgroup_files_dfl[8];
> > +     struct cftype cgroup_files_legacy[10];
> >  #endif
> >       char name[HSTATE_NAME_LEN];
> >  };
> > diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> > index c137396129db..54ff6ec68ed3 100644
> > --- a/include/linux/hugetlb_cgroup.h
> > +++ b/include/linux/hugetlb_cgroup.h
> > @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
> >       HUGETLB_NR_MEMORY_EVENTS,
> >  };
> >
> > +struct hugetlb_cgroup_per_node {
> > +     /* hugetlb usage in bytes over all hstates. */
> > +     unsigned long usage[HUGE_MAX_HSTATE];
> > +};
> > +
> >  struct hugetlb_cgroup {
> >       struct cgroup_subsys_state css;
> >
> > @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
> >
> >       /* Handle for "hugetlb.events.local" */
> >       struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> > +
> > +     struct hugetlb_cgroup_per_node *nodeinfo[];
> >  };
> >
> >  static inline struct hugetlb_cgroup *
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 5383023d0cca..4b807292f7e8 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> >                               struct hugetlb_cgroup *parent_h_cgroup)
> >  {
> >       int idx;
> > +     int node;
> >
> >       for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
> >               struct page_counter *fault_parent = NULL;
> > @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> >                       limit);
> >               VM_BUG_ON(ret);
> >       }
> > +
> > +     for_each_node(node) {
> > +             /* Set node_to_alloc to -1 for offline nodes. */
> > +             int node_to_alloc =
> > +                     node_state(node, N_NORMAL_MEMORY) ? node : -1;
> > +             h_cgroup->nodeinfo[node] =
> > +                     kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> > +                                  GFP_KERNEL, node_to_alloc);
>
> Do you need to handle kzalloc_node failure here?
>

Will do.

> alloc_mem_cgroup_per_node_info provides similar functionality and has
> the following comment.
>
>          * TODO: this routine can waste much memory for nodes which will
>          *       never be onlined. It's better to use memory hotplug callback
>          *       function.
>

So the memory allocated per node in total is (HUGE_MAX_HSTATE *
unsigned long * num_css_on_the system). HUGE_MAX_HSTATE is 2 on x86.
This is significant, but to address this comment I have to complicate
the hugetlb_cgroup code quite a bit so I thought I'd check with you if
you think it's still OK/worth it. slub.c implements these changes
(slab_memory_callback) and they are:

- When creating a new hugetlb_cgroup, I create per_node data for only
online nodes.
- On node online I need to loop over all existing hugetlb_cgroups and
allocate per_node data. I need rcu_read_lock() here and below.
- On node offline I need to loop over all existing hugetlb_cgroups and
free per_node data.
- If I follow the slub example, I need a lock to synchronize
onlining/offlining with references to per_node data in commit_charge()
uncharge_page() and show_numa_stats().

I don't think it's worth it TBH, but I'm happy to oblige if you think so.

> Extra credit if you do this here and in alloc_mem_cgroup_per_node_info. :)
>
> > +     }
> >  }
> >
> >  static struct cgroup_subsys_state *
> > @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >       struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
> >       struct hugetlb_cgroup *h_cgroup;
> >
> > -     h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> > +     unsigned int size =
> > +             sizeof(*h_cgroup) +
> > +             MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
> > +     h_cgroup = kzalloc(size, GFP_KERNEL);
> >       if (!h_cgroup)
> >               return ERR_PTR(-ENOMEM);
> >
> > @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> >               return;
> >
> >       __set_hugetlb_cgroup(page, h_cg, rsvd);
> > -     return;
> > +     if (!rsvd && h_cg)
> > +             h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
> > +                                                              << PAGE_SHIFT;
> >  }
> >
> >  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> > @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >
> >       if (rsvd)
> >               css_put(&h_cg->css);
> > -
> > +     else
> > +             h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
> > +                                                              << PAGE_SHIFT;
> >       return;
> >  }
> >
> > @@ -421,6 +438,56 @@ enum {
> >       RES_RSVD_FAILCNT,
> >  };
> >
> > +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> > +{
> > +     int nid;
> > +     struct cftype *cft = seq_cft(seq);
> > +     int idx = MEMFILE_IDX(cft->private);
> > +     bool legacy = MEMFILE_ATTR(cft->private);
> > +     struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> > +     struct cgroup_subsys_state *css;
> > +     unsigned long usage;
> > +
> > +     if (legacy) {
> > +             /* Add up usage across all nodes for the non-hierarchical total. */
> > +             usage = 0;
> > +             for_each_node_state(nid, N_MEMORY)
> > +                     usage += h_cg->nodeinfo[nid]->usage[idx];
> > +             seq_printf(seq, "total=%lu", usage);
> > +
> > +             /* Simply print the per-node usage for the non-hierarchical total. */
> > +             for_each_node_state(nid, N_MEMORY)
> > +                     seq_printf(seq, " N%d=%lu", nid,
> > +                                h_cg->nodeinfo[nid]->usage[idx]);
> > +             seq_putc(seq, '\n');
> > +     }
> > +
> > +     /* The hierarchical total is pretty much the value recorded by the
> > +      * counter, so use that.
>
> 'pretty much' ... so use that.  Here is one place in particular where
> input from someone with more cgroup experience would be appreciated.
>
> --
> Mike Kravetz
>
> > +      */
> > +     seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> > +                page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> > +
> > +     /* For each node, transverse the css tree to obtain the hierarichal
> > +      * node usage.
> > +      */
> > +     for_each_node_state(nid, N_MEMORY) {
> > +             usage = 0;
> > +             rcu_read_lock();
> > +             css_for_each_descendant_pre(css, &h_cg->css) {
> > +                     usage += hugetlb_cgroup_from_css(css)
> > +                                      ->nodeinfo[nid]
> > +                                      ->usage[idx];
> > +             }
> > +             rcu_read_unlock();
> > +             seq_printf(seq, " N%d=%lu", nid, usage);
> > +     }
> > +
> > +     seq_putc(seq, '\n');
> > +
> > +     return 0;
> > +}
> > +
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >                                  struct cftype *cft)
> >  {
> > @@ -654,8 +721,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >       cft->seq_show = hugetlb_cgroup_read_u64_max;
> >       cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> > -     /* Add the events file */
> > +     /* Add the numa stat file */
> >       cft = &h->cgroup_files_dfl[4];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> > +     cft->seq_show = hugetlb_cgroup_read_numa_stat;
> > +     cft->flags = CFTYPE_NOT_ON_ROOT;
> > +
> > +     /* Add the events file */
> > +     cft = &h->cgroup_files_dfl[5];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, 0);
> >       cft->seq_show = hugetlb_events_show;
> > @@ -663,7 +736,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >       cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> >       /* Add the events.local file */
> > -     cft = &h->cgroup_files_dfl[5];
> > +     cft = &h->cgroup_files_dfl[6];
> >       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
> >       cft->private = MEMFILE_PRIVATE(idx, 0);
> >       cft->seq_show = hugetlb_events_local_show;
> > @@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >       cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> >       /* NULL terminate the last cft */
> > -     cft = &h->cgroup_files_dfl[6];
> > +     cft = &h->cgroup_files_dfl[7];
> >       memset(cft, 0, sizeof(*cft));
> >
> >       WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
> > @@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
> >       cft->write = hugetlb_cgroup_reset;
> >       cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +     /* Add the numa stat file */
> > +     cft = &h->cgroup_files_dfl[8];
> > +     snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> > +     cft->private = MEMFILE_PRIVATE(idx, 1);
> > +     cft->seq_show = hugetlb_cgroup_read_numa_stat;
> > +
> >       /* NULL terminate the last cft */
> > -     cft = &h->cgroup_files_legacy[8];
> > +     cft = &h->cgroup_files_legacy[9];
> >       memset(cft, 0, sizeof(*cft));
> >
> >       WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> > diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > index 6a2caba19ee1..d2da6315a40c 100644
> > --- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > @@ -37,8 +37,8 @@ static int shmid;
> >  static void exit_usage(void)
> >  {
> >       printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> > -            "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> > -            "[-o] [-w] [-n]\n",
> > +            "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
> > +            "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
> >              self);
> >       exit(EXIT_FAILURE);
> >  }
> > @@ -161,6 +161,11 @@ int main(int argc, char **argv)
> >       else
> >               printf("RESERVE mapping.\n");
> >
> > +     if (want_sleep)
> > +             printf("Sleeping\n");
> > +     else
> > +             printf("Not sleeping\n");
> > +
> >       switch (method) {
> >       case HUGETLBFS:
> >               printf("Allocating using HUGETLBFS.\n");
> > --
> > 2.33.0.1079.g6e70778dc9-goog
> >
Mike Kravetz Nov. 1, 2021, 11:19 p.m. UTC | #3
On 10/31/21 1:39 PM, Mina Almasry wrote:
> On Wed, Oct 27, 2021 at 4:36 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 10/20/21 12:09 PM, Mina Almasry wrote:
>>
>> I have no objections to adding this functionality, and do not see any
>> blocking issues in hugetlb code.  However, it would be GREAT if someone
>> more familiar/experienced with cgroups would comment.  My cgroup
>> experience is very limited.
>>
> 
> I will send V2 to Shakeel as well from our team and ask him to take a
> look and he has more than enough experience to review. If anyone else
> reading with cgroups knowledge can Review/Ack that would be great.
> 
> It's possible I'm a bit off the mark here, but FWIW I don't think
> there is much that is continuous or ambiguous here. For
> memory.numa_stat it's a bit nuanced because there is anon, rss, shmem,
> etc.. but for hugetlb we just have hugetlb memory and the only care
> needed is hierarchical vs not in cgroups v1.
> 

I agree.  It is straight forward from a hugetlb POV.  Just would like to
get an ack from someone more familiar with cgroups.  To me it also looks
fine from a cgroups POV, but my cgroup knowledge is so limited that does
not mean much.

>> alloc_mem_cgroup_per_node_info provides similar functionality and has
>> the following comment.
>>
>>          * TODO: this routine can waste much memory for nodes which will
>>          *       never be onlined. It's better to use memory hotplug callback
>>          *       function.
>>
> 
> So the memory allocated per node in total is (HUGE_MAX_HSTATE *
> unsigned long * num_css_on_the system). HUGE_MAX_HSTATE is 2 on x86.
> This is significant, but to address this comment I have to complicate
> the hugetlb_cgroup code quite a bit so I thought I'd check with you if
> you think it's still OK/worth it. slub.c implements these changes
> (slab_memory_callback) and they are:
> 
> - When creating a new hugetlb_cgroup, I create per_node data for only
> online nodes.
> - On node online I need to loop over all existing hugetlb_cgroups and
> allocate per_node data. I need rcu_read_lock() here and below.
> - On node offline I need to loop over all existing hugetlb_cgroups and
> free per_node data.
> - If I follow the slub example, I need a lock to synchronize
> onlining/offlining with references to per_node data in commit_charge()
> uncharge_page() and show_numa_stats().
> 
> I don't think it's worth it TBH, but I'm happy to oblige if you think so.
> 

No need to do anything extra/special here.  I just noted that you would
be allocating memory for offline nodes and took a look to see if the
memory cgroup code handled this case.  They do not, and have this as a
TODO.  I think that is sufficient here.

Thanks,
Muchun Song Nov. 2, 2021, 5:11 a.m. UTC | #4
On Thu, Oct 21, 2021 at 3:09 AM Mina Almasry <almasrymina@google.com> wrote:
>
> For hugetlb backed jobs/VMs it's critical to understand the numa
> information for the memory backing these jobs to deliver optimal
> performance.
>
> Currently this techinically can be queried from /proc/self/numa_maps, but
> there are significant issues with that. Namely:
> 1. Memory can be mapped on unmapped.
> 2. numa_maps are per process and need to be aggregaged across all
>    proceses in the cgroup. For shared memory this is more involved as
>    the userspace needs to make sure it doesn't double count shared
>    mappings.
> 3. I believe querying numa_maps needs to hold the mmap_lock which adds
>    to the contention on this lock.
>
> For these reasons I propose simply adding hugetlb.*.numa_stat file,
> which shows the numa information of the cgroup similarly to
> memory.numa_stat.
>
> On cgroup-v2:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>
> On cgroup-v1:
>    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>    hierarichal_total=2097152 N0=2097152 N1=0
>
> This patch was tested manually by allocating hugetlb memory and querying
> the hugetlb.*.numa_stat file of the cgroup and its parents.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jue Wang <juew@google.com>
> Cc: Yang Yao <ygyao@google.com>
> Cc: Joanna Li <joannali@google.com>
> Cc: Cannon Matthews <cannonmatthews@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
>
> Changes in v2:
> - Fix warning Reported-by: kernel test robot <lkp@intel.com>
> ---
>  .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
>  Documentation/admin-guide/cgroup-v2.rst       |  7 ++
>  include/linux/hugetlb.h                       |  4 +-
>  include/linux/hugetlb_cgroup.h                |  7 ++
>  mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
>  .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
>  6 files changed, 113 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> index 338f2c7d7a1c..0fa724d82abb 100644
> --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> @@ -29,12 +29,14 @@ Brief summary of control files::
>   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
>   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
>
>  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
>  files include::
>
>    hugetlb.1GB.limit_in_bytes
>    hugetlb.1GB.max_usage_in_bytes
> +  hugetlb.1GB.numa_stat
>    hugetlb.1GB.usage_in_bytes
>    hugetlb.1GB.failcnt
>    hugetlb.1GB.rsvd.limit_in_bytes
> @@ -43,6 +45,7 @@ files include::
>    hugetlb.1GB.rsvd.failcnt
>    hugetlb.64KB.limit_in_bytes
>    hugetlb.64KB.max_usage_in_bytes
> +  hugetlb.64KB.numa_stat
>    hugetlb.64KB.usage_in_bytes
>    hugetlb.64KB.failcnt
>    hugetlb.64KB.rsvd.limit_in_bytes
> @@ -51,6 +54,7 @@ files include::
>    hugetlb.64KB.rsvd.failcnt
>    hugetlb.32MB.limit_in_bytes
>    hugetlb.32MB.max_usage_in_bytes
> +  hugetlb.32MB.numa_stat
>    hugetlb.32MB.usage_in_bytes
>    hugetlb.32MB.failcnt
>    hugetlb.32MB.rsvd.limit_in_bytes
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 4d8c27eca96b..8ba0d6aadd2c 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2252,6 +2252,13 @@ HugeTLB Interface Files
>         are local to the cgroup i.e. not hierarchical. The file modified event
>         generated on this file reflects only the local events.
>
> +  hugetlb.<hugepagesize>.numa_stat
> +       Similar to memory.numa_stat, it shows the numa information of the
> +       memory in this cgroup:
> +
> +       /dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
> +       total=0 N0=0 N1=0
> +
>  Misc
>  ----
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..0445faaa636e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -613,8 +613,8 @@ struct hstate {
>  #endif
>  #ifdef CONFIG_CGROUP_HUGETLB
>         /* cgroup control files */
> -       struct cftype cgroup_files_dfl[7];
> -       struct cftype cgroup_files_legacy[9];
> +       struct cftype cgroup_files_dfl[8];
> +       struct cftype cgroup_files_legacy[10];
>  #endif
>         char name[HSTATE_NAME_LEN];
>  };
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index c137396129db..54ff6ec68ed3 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
>         HUGETLB_NR_MEMORY_EVENTS,
>  };
>
> +struct hugetlb_cgroup_per_node {
> +       /* hugetlb usage in bytes over all hstates. */
> +       unsigned long usage[HUGE_MAX_HSTATE];
> +};
> +
>  struct hugetlb_cgroup {
>         struct cgroup_subsys_state css;
>
> @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
>
>         /* Handle for "hugetlb.events.local" */
>         struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> +
> +       struct hugetlb_cgroup_per_node *nodeinfo[];
>  };
>
>  static inline struct hugetlb_cgroup *
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 5383023d0cca..4b807292f7e8 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>                                 struct hugetlb_cgroup *parent_h_cgroup)
>  {
>         int idx;
> +       int node;
>
>         for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
>                 struct page_counter *fault_parent = NULL;
> @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>                         limit);
>                 VM_BUG_ON(ret);
>         }
> +
> +       for_each_node(node) {
> +               /* Set node_to_alloc to -1 for offline nodes. */
> +               int node_to_alloc =
> +                       node_state(node, N_NORMAL_MEMORY) ? node : -1;
> +               h_cgroup->nodeinfo[node] =
> +                       kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> +                                    GFP_KERNEL, node_to_alloc);
> +       }
>  }
>
>  static struct cgroup_subsys_state *
> @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>         struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
>         struct hugetlb_cgroup *h_cgroup;
>
> -       h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> +       unsigned int size =
> +               sizeof(*h_cgroup) +
> +               MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
> +       h_cgroup = kzalloc(size, GFP_KERNEL);

I recommend using struct_size() to calculate the allocated size like
following.

  h_cgroup = kzalloc(struct_size(h_cgroup, nodeinfo, MAX_NUMNODES), GFP_KERNEL);

I also have a question, can't we replace MAX_NUMNODES with
nr_node_ids here? In that case, we can save some memory.

>         if (!h_cgroup)
>                 return ERR_PTR(-ENOMEM);
>
> @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>                 return;
>
>         __set_hugetlb_cgroup(page, h_cg, rsvd);
> -       return;
> +       if (!rsvd && h_cg)
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
> +                                                                << PAGE_SHIFT;
>  }
>
>  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>
>         if (rsvd)
>                 css_put(&h_cg->css);
> -
> +       else
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
> +                                                                << PAGE_SHIFT;
>         return;
>  }
>
> @@ -421,6 +438,56 @@ enum {
>         RES_RSVD_FAILCNT,
>  };
>
> +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> +{
> +       int nid;
> +       struct cftype *cft = seq_cft(seq);
> +       int idx = MEMFILE_IDX(cft->private);
> +       bool legacy = MEMFILE_ATTR(cft->private);
> +       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> +       struct cgroup_subsys_state *css;
> +       unsigned long usage;
> +
> +       if (legacy) {
> +               /* Add up usage across all nodes for the non-hierarchical total. */
> +               usage = 0;
> +               for_each_node_state(nid, N_MEMORY)
> +                       usage += h_cg->nodeinfo[nid]->usage[idx];
> +               seq_printf(seq, "total=%lu", usage);
> +
> +               /* Simply print the per-node usage for the non-hierarchical total. */
> +               for_each_node_state(nid, N_MEMORY)
> +                       seq_printf(seq, " N%d=%lu", nid,
> +                                  h_cg->nodeinfo[nid]->usage[idx]);
> +               seq_putc(seq, '\n');
> +       }
> +
> +       /* The hierarchical total is pretty much the value recorded by the
> +        * counter, so use that.
> +        */

Please use the following pattern for multi-line comments.

/*
 * Comments.
 */

> +       seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> +                  page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> +
> +       /* For each node, transverse the css tree to obtain the hierarichal
> +        * node usage.
> +        */

Same here.

> +       for_each_node_state(nid, N_MEMORY) {
> +               usage = 0;
> +               rcu_read_lock();
> +               css_for_each_descendant_pre(css, &h_cg->css) {
> +                       usage += hugetlb_cgroup_from_css(css)
> +                                        ->nodeinfo[nid]
> +                                        ->usage[idx];
> +               }
> +               rcu_read_unlock();
> +               seq_printf(seq, " N%d=%lu", nid, usage);
> +       }
> +
> +       seq_putc(seq, '\n');
> +
> +       return 0;
> +}
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>                                    struct cftype *cft)
>  {
> @@ -654,8 +721,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>         cft->seq_show = hugetlb_cgroup_read_u64_max;
>         cft->flags = CFTYPE_NOT_ON_ROOT;
>
> -       /* Add the events file */
> +       /* Add the numa stat file */
>         cft = &h->cgroup_files_dfl[4];

Why not use &h->cgroup_files_dfl[6] as a numa stat file?
It could be simple just like you did in
__hugetlb_cgroup_file_legacy_init().

Thanks.

> +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +       cft->flags = CFTYPE_NOT_ON_ROOT;
> +
> +       /* Add the events file */
> +       cft = &h->cgroup_files_dfl[5];
>         snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
>         cft->private = MEMFILE_PRIVATE(idx, 0);
>         cft->seq_show = hugetlb_events_show;
> @@ -663,7 +736,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>         cft->flags = CFTYPE_NOT_ON_ROOT;
>
>         /* Add the events.local file */
> -       cft = &h->cgroup_files_dfl[5];
> +       cft = &h->cgroup_files_dfl[6];
>         snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
>         cft->private = MEMFILE_PRIVATE(idx, 0);
>         cft->seq_show = hugetlb_events_local_show;
> @@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>         cft->flags = CFTYPE_NOT_ON_ROOT;
>
>         /* NULL terminate the last cft */
> -       cft = &h->cgroup_files_dfl[6];
> +       cft = &h->cgroup_files_dfl[7];
>         memset(cft, 0, sizeof(*cft));
>
>         WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
> @@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
>         cft->write = hugetlb_cgroup_reset;
>         cft->read_u64 = hugetlb_cgroup_read_u64;
>
> +       /* Add the numa stat file */
> +       cft = &h->cgroup_files_dfl[8];
> +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +       cft->private = MEMFILE_PRIVATE(idx, 1);
> +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +
>         /* NULL terminate the last cft */
> -       cft = &h->cgroup_files_legacy[8];
> +       cft = &h->cgroup_files_legacy[9];
>         memset(cft, 0, sizeof(*cft));
>
>         WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> index 6a2caba19ee1..d2da6315a40c 100644
> --- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
> +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> @@ -37,8 +37,8 @@ static int shmid;
>  static void exit_usage(void)
>  {
>         printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> -              "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> -              "[-o] [-w] [-n]\n",
> +              "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
> +              "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
>                self);
>         exit(EXIT_FAILURE);
>  }
> @@ -161,6 +161,11 @@ int main(int argc, char **argv)
>         else
>                 printf("RESERVE mapping.\n");
>
> +       if (want_sleep)
> +               printf("Sleeping\n");
> +       else
> +               printf("Not sleeping\n");
> +
>         switch (method) {
>         case HUGETLBFS:
>                 printf("Allocating using HUGETLBFS.\n");
> --
> 2.33.0.1079.g6e70778dc9-goog
Mina Almasry Nov. 4, 2021, 1:43 a.m. UTC | #5
On Mon, Nov 1, 2021 at 10:12 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Thu, Oct 21, 2021 at 3:09 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > For hugetlb backed jobs/VMs it's critical to understand the numa
> > information for the memory backing these jobs to deliver optimal
> > performance.
> >
> > Currently this techinically can be queried from /proc/self/numa_maps, but
> > there are significant issues with that. Namely:
> > 1. Memory can be mapped on unmapped.
> > 2. numa_maps are per process and need to be aggregaged across all
> >    proceses in the cgroup. For shared memory this is more involved as
> >    the userspace needs to make sure it doesn't double count shared
> >    mappings.
> > 3. I believe querying numa_maps needs to hold the mmap_lock which adds
> >    to the contention on this lock.
> >
> > For these reasons I propose simply adding hugetlb.*.numa_stat file,
> > which shows the numa information of the cgroup similarly to
> > memory.numa_stat.
> >
> > On cgroup-v2:
> >    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
> >    total=2097152 N0=2097152 N1=0
> >
> > On cgroup-v1:
> >    cat /dev/cgroup/memory/test/hugetlb.2MB.numa_stat
> >    total=2097152 N0=2097152 N1=0
> >    hierarichal_total=2097152 N0=2097152 N1=0
> >
> > This patch was tested manually by allocating hugetlb memory and querying
> > the hugetlb.*.numa_stat file of the cgroup and its parents.
> > 
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: Miaohe Lin <linmiaohe@huawei.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Muchun Song <songmuchun@bytedance.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Jue Wang <juew@google.com>
> > Cc: Yang Yao <ygyao@google.com>
> > Cc: Joanna Li <joannali@google.com>
> > Cc: Cannon Matthews <cannonmatthews@google.com>
> > Cc: linux-mm@kvack.org
> > Cc: linux-kernel@vger.kernel.org
> >
> > Signed-off-by: Mina Almasry <almasrymina@google.com>
> >

Thank you very much Mike and Muchun for the review. I've uploaded v3
with the fixes!

> > ---
> >
> > Changes in v2:
> > - Fix warning Reported-by: kernel test robot <lkp@intel.com>
> > ---
> >  .../admin-guide/cgroup-v1/hugetlb.rst         |  4 +
> >  Documentation/admin-guide/cgroup-v2.rst       |  7 ++
> >  include/linux/hugetlb.h                       |  4 +-
> >  include/linux/hugetlb_cgroup.h                |  7 ++
> >  mm/hugetlb_cgroup.c                           | 93 +++++++++++++++++--
> >  .../testing/selftests/vm/write_to_hugetlbfs.c |  9 +-
> >  6 files changed, 113 insertions(+), 11 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > index 338f2c7d7a1c..0fa724d82abb 100644
> > --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> > @@ -29,12 +29,14 @@ Brief summary of control files::
> >   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
> >   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
> >   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> > + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
> >
> >  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
> >  files include::
> >
> >    hugetlb.1GB.limit_in_bytes
> >    hugetlb.1GB.max_usage_in_bytes
> > +  hugetlb.1GB.numa_stat
> >    hugetlb.1GB.usage_in_bytes
> >    hugetlb.1GB.failcnt
> >    hugetlb.1GB.rsvd.limit_in_bytes
> > @@ -43,6 +45,7 @@ files include::
> >    hugetlb.1GB.rsvd.failcnt
> >    hugetlb.64KB.limit_in_bytes
> >    hugetlb.64KB.max_usage_in_bytes
> > +  hugetlb.64KB.numa_stat
> >    hugetlb.64KB.usage_in_bytes
> >    hugetlb.64KB.failcnt
> >    hugetlb.64KB.rsvd.limit_in_bytes
> > @@ -51,6 +54,7 @@ files include::
> >    hugetlb.64KB.rsvd.failcnt
> >    hugetlb.32MB.limit_in_bytes
> >    hugetlb.32MB.max_usage_in_bytes
> > +  hugetlb.32MB.numa_stat
> >    hugetlb.32MB.usage_in_bytes
> >    hugetlb.32MB.failcnt
> >    hugetlb.32MB.rsvd.limit_in_bytes
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 4d8c27eca96b..8ba0d6aadd2c 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -2252,6 +2252,13 @@ HugeTLB Interface Files
> >         are local to the cgroup i.e. not hierarchical. The file modified event
> >         generated on this file reflects only the local events.
> >
> > +  hugetlb.<hugepagesize>.numa_stat
> > +       Similar to memory.numa_stat, it shows the numa information of the
> > +       memory in this cgroup:
> > +
> > +       /dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
> > +       total=0 N0=0 N1=0
> > +
> >  Misc
> >  ----
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 1faebe1cd0ed..0445faaa636e 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -613,8 +613,8 @@ struct hstate {
> >  #endif
> >  #ifdef CONFIG_CGROUP_HUGETLB
> >         /* cgroup control files */
> > -       struct cftype cgroup_files_dfl[7];
> > -       struct cftype cgroup_files_legacy[9];
> > +       struct cftype cgroup_files_dfl[8];
> > +       struct cftype cgroup_files_legacy[10];
> >  #endif
> >         char name[HSTATE_NAME_LEN];
> >  };
> > diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> > index c137396129db..54ff6ec68ed3 100644
> > --- a/include/linux/hugetlb_cgroup.h
> > +++ b/include/linux/hugetlb_cgroup.h
> > @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
> >         HUGETLB_NR_MEMORY_EVENTS,
> >  };
> >
> > +struct hugetlb_cgroup_per_node {
> > +       /* hugetlb usage in bytes over all hstates. */
> > +       unsigned long usage[HUGE_MAX_HSTATE];
> > +};
> > +
> >  struct hugetlb_cgroup {
> >         struct cgroup_subsys_state css;
> >
> > @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
> >
> >         /* Handle for "hugetlb.events.local" */
> >         struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> > +
> > +       struct hugetlb_cgroup_per_node *nodeinfo[];
> >  };
> >
> >  static inline struct hugetlb_cgroup *
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 5383023d0cca..4b807292f7e8 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -92,6 +92,7 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> >                                 struct hugetlb_cgroup *parent_h_cgroup)
> >  {
> >         int idx;
> > +       int node;
> >
> >         for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
> >                 struct page_counter *fault_parent = NULL;
> > @@ -124,6 +125,15 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
> >                         limit);
> >                 VM_BUG_ON(ret);
> >         }
> > +
> > +       for_each_node(node) {
> > +               /* Set node_to_alloc to -1 for offline nodes. */
> > +               int node_to_alloc =
> > +                       node_state(node, N_NORMAL_MEMORY) ? node : -1;
> > +               h_cgroup->nodeinfo[node] =
> > +                       kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> > +                                    GFP_KERNEL, node_to_alloc);
> > +       }
> >  }
> >
> >  static struct cgroup_subsys_state *
> > @@ -132,7 +142,10 @@ hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >         struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
> >         struct hugetlb_cgroup *h_cgroup;
> >
> > -       h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
> > +       unsigned int size =
> > +               sizeof(*h_cgroup) +
> > +               MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
> > +       h_cgroup = kzalloc(size, GFP_KERNEL);
>
> I recommend using struct_size() to calculate the allocated size like
> following.
>
>   h_cgroup = kzalloc(struct_size(h_cgroup, nodeinfo, MAX_NUMNODES), GFP_KERNEL);
>
> I also have a question, can't we replace MAX_NUMNODES with
> nr_node_ids here? In that case, we can save some memory.
>
> >         if (!h_cgroup)
> >                 return ERR_PTR(-ENOMEM);
> >
> > @@ -292,7 +305,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> >                 return;
> >
> >         __set_hugetlb_cgroup(page, h_cg, rsvd);
> > -       return;
> > +       if (!rsvd && h_cg)
> > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
> > +                                                                << PAGE_SHIFT;
> >  }
> >
> >  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> > @@ -331,7 +346,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
> >
> >         if (rsvd)
> >                 css_put(&h_cg->css);
> > -
> > +       else
> > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
> > +                                                                << PAGE_SHIFT;
> >         return;
> >  }
> >
> > @@ -421,6 +438,56 @@ enum {
> >         RES_RSVD_FAILCNT,
> >  };
> >
> > +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> > +{
> > +       int nid;
> > +       struct cftype *cft = seq_cft(seq);
> > +       int idx = MEMFILE_IDX(cft->private);
> > +       bool legacy = MEMFILE_ATTR(cft->private);
> > +       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> > +       struct cgroup_subsys_state *css;
> > +       unsigned long usage;
> > +
> > +       if (legacy) {
> > +               /* Add up usage across all nodes for the non-hierarchical total. */
> > +               usage = 0;
> > +               for_each_node_state(nid, N_MEMORY)
> > +                       usage += h_cg->nodeinfo[nid]->usage[idx];
> > +               seq_printf(seq, "total=%lu", usage);
> > +
> > +               /* Simply print the per-node usage for the non-hierarchical total. */
> > +               for_each_node_state(nid, N_MEMORY)
> > +                       seq_printf(seq, " N%d=%lu", nid,
> > +                                  h_cg->nodeinfo[nid]->usage[idx]);
> > +               seq_putc(seq, '\n');
> > +       }
> > +
> > +       /* The hierarchical total is pretty much the value recorded by the
> > +        * counter, so use that.
> > +        */
>
> Please use the following pattern for multi-line comments.
>
> /*
>  * Comments.
>  */
>
> > +       seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> > +                  page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> > +
> > +       /* For each node, transverse the css tree to obtain the hierarichal
> > +        * node usage.
> > +        */
>
> Same here.
>
> > +       for_each_node_state(nid, N_MEMORY) {
> > +               usage = 0;
> > +               rcu_read_lock();
> > +               css_for_each_descendant_pre(css, &h_cg->css) {
> > +                       usage += hugetlb_cgroup_from_css(css)
> > +                                        ->nodeinfo[nid]
> > +                                        ->usage[idx];
> > +               }
> > +               rcu_read_unlock();
> > +               seq_printf(seq, " N%d=%lu", nid, usage);
> > +       }
> > +
> > +       seq_putc(seq, '\n');
> > +
> > +       return 0;
> > +}
> > +
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >                                    struct cftype *cft)
> >  {
> > @@ -654,8 +721,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >         cft->seq_show = hugetlb_cgroup_read_u64_max;
> >         cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> > -       /* Add the events file */
> > +       /* Add the numa stat file */
> >         cft = &h->cgroup_files_dfl[4];
>
> Why not use &h->cgroup_files_dfl[6] as a numa stat file?
> It could be simple just like you did in
> __hugetlb_cgroup_file_legacy_init().
>
> Thanks.
>
> > +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> > +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> > +       cft->flags = CFTYPE_NOT_ON_ROOT;
> > +
> > +       /* Add the events file */
> > +       cft = &h->cgroup_files_dfl[5];
> >         snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
> >         cft->private = MEMFILE_PRIVATE(idx, 0);
> >         cft->seq_show = hugetlb_events_show;
> > @@ -663,7 +736,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >         cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> >         /* Add the events.local file */
> > -       cft = &h->cgroup_files_dfl[5];
> > +       cft = &h->cgroup_files_dfl[6];
> >         snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
> >         cft->private = MEMFILE_PRIVATE(idx, 0);
> >         cft->seq_show = hugetlb_events_local_show;
> > @@ -672,7 +745,7 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
> >         cft->flags = CFTYPE_NOT_ON_ROOT;
> >
> >         /* NULL terminate the last cft */
> > -       cft = &h->cgroup_files_dfl[6];
> > +       cft = &h->cgroup_files_dfl[7];
> >         memset(cft, 0, sizeof(*cft));
> >
> >         WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
> > @@ -742,8 +815,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
> >         cft->write = hugetlb_cgroup_reset;
> >         cft->read_u64 = hugetlb_cgroup_read_u64;
> >
> > +       /* Add the numa stat file */
> > +       cft = &h->cgroup_files_dfl[8];
> > +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> > +       cft->private = MEMFILE_PRIVATE(idx, 1);
> > +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> > +
> >         /* NULL terminate the last cft */
> > -       cft = &h->cgroup_files_legacy[8];
> > +       cft = &h->cgroup_files_legacy[9];
> >         memset(cft, 0, sizeof(*cft));
> >
> >         WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> > diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > index 6a2caba19ee1..d2da6315a40c 100644
> > --- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> > @@ -37,8 +37,8 @@ static int shmid;
> >  static void exit_usage(void)
> >  {
> >         printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
> > -              "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
> > -              "[-o] [-w] [-n]\n",
> > +              "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
> > +              "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
> >                self);
> >         exit(EXIT_FAILURE);
> >  }
> > @@ -161,6 +161,11 @@ int main(int argc, char **argv)
> >         else
> >                 printf("RESERVE mapping.\n");
> >
> > +       if (want_sleep)
> > +               printf("Sleeping\n");
> > +       else
> > +               printf("Not sleeping\n");
> > +
> >         switch (method) {
> >         case HUGETLBFS:
> >                 printf("Allocating using HUGETLBFS.\n");
> > --
> > 2.33.0.1079.g6e70778dc9-goog
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..0fa724d82abb 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -29,12 +29,14 @@  Brief summary of control files::
  hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
  hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
  hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
+ hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::

   hugetlb.1GB.limit_in_bytes
   hugetlb.1GB.max_usage_in_bytes
+  hugetlb.1GB.numa_stat
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
   hugetlb.1GB.rsvd.limit_in_bytes
@@ -43,6 +45,7 @@  files include::
   hugetlb.1GB.rsvd.failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
+  hugetlb.64KB.numa_stat
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
   hugetlb.64KB.rsvd.limit_in_bytes
@@ -51,6 +54,7 @@  files include::
   hugetlb.64KB.rsvd.failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
+  hugetlb.32MB.numa_stat
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
   hugetlb.32MB.rsvd.limit_in_bytes
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4d8c27eca96b..8ba0d6aadd2c 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2252,6 +2252,13 @@  HugeTLB Interface Files
 	are local to the cgroup i.e. not hierarchical. The file modified event
 	generated on this file reflects only the local events.

+  hugetlb.<hugepagesize>.numa_stat
+	Similar to memory.numa_stat, it shows the numa information of the
+	memory in this cgroup:
+
+	/dev/cgroup/memory/test # cat hugetlb.2MB.numa_stat
+	total=0 N0=0 N1=0
+
 Misc
 ----

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..0445faaa636e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -613,8 +613,8 @@  struct hstate {
 #endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files_dfl[7];
-	struct cftype cgroup_files_legacy[9];
+	struct cftype cgroup_files_dfl[8];
+	struct cftype cgroup_files_legacy[10];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c137396129db..54ff6ec68ed3 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -36,6 +36,11 @@  enum hugetlb_memory_event {
 	HUGETLB_NR_MEMORY_EVENTS,
 };

+struct hugetlb_cgroup_per_node {
+	/* hugetlb usage in bytes over all hstates. */
+	unsigned long usage[HUGE_MAX_HSTATE];
+};
+
 struct hugetlb_cgroup {
 	struct cgroup_subsys_state css;

@@ -57,6 +62,8 @@  struct hugetlb_cgroup {

 	/* Handle for "hugetlb.events.local" */
 	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
+
+	struct hugetlb_cgroup_per_node *nodeinfo[];
 };

 static inline struct hugetlb_cgroup *
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 5383023d0cca..4b807292f7e8 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -92,6 +92,7 @@  static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 				struct hugetlb_cgroup *parent_h_cgroup)
 {
 	int idx;
+	int node;

 	for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
 		struct page_counter *fault_parent = NULL;
@@ -124,6 +125,15 @@  static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 			limit);
 		VM_BUG_ON(ret);
 	}
+
+	for_each_node(node) {
+		/* Set node_to_alloc to -1 for offline nodes. */
+		int node_to_alloc =
+			node_state(node, N_NORMAL_MEMORY) ? node : -1;
+		h_cgroup->nodeinfo[node] =
+			kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
+				     GFP_KERNEL, node_to_alloc);
+	}
 }

 static struct cgroup_subsys_state *
@@ -132,7 +142,10 @@  hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
 	struct hugetlb_cgroup *h_cgroup;

-	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
+	unsigned int size =
+		sizeof(*h_cgroup) +
+		MAX_NUMNODES * sizeof(struct hugetlb_cgroup_per_node *);
+	h_cgroup = kzalloc(size, GFP_KERNEL);
 	if (!h_cgroup)
 		return ERR_PTR(-ENOMEM);

@@ -292,7 +305,9 @@  static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 		return;

 	__set_hugetlb_cgroup(page, h_cg, rsvd);
-	return;
+	if (!rsvd && h_cg)
+		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages
+								 << PAGE_SHIFT;
 }

 void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
@@ -331,7 +346,9 @@  static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,

 	if (rsvd)
 		css_put(&h_cg->css);
-
+	else
+		h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages
+								 << PAGE_SHIFT;
 	return;
 }

@@ -421,6 +438,56 @@  enum {
 	RES_RSVD_FAILCNT,
 };

+static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
+{
+	int nid;
+	struct cftype *cft = seq_cft(seq);
+	int idx = MEMFILE_IDX(cft->private);
+	bool legacy = MEMFILE_ATTR(cft->private);
+	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
+	struct cgroup_subsys_state *css;
+	unsigned long usage;
+
+	if (legacy) {
+		/* Add up usage across all nodes for the non-hierarchical total. */
+		usage = 0;
+		for_each_node_state(nid, N_MEMORY)
+			usage += h_cg->nodeinfo[nid]->usage[idx];
+		seq_printf(seq, "total=%lu", usage);
+
+		/* Simply print the per-node usage for the non-hierarchical total. */
+		for_each_node_state(nid, N_MEMORY)
+			seq_printf(seq, " N%d=%lu", nid,
+				   h_cg->nodeinfo[nid]->usage[idx]);
+		seq_putc(seq, '\n');
+	}
+
+	/* The hierarchical total is pretty much the value recorded by the
+	 * counter, so use that.
+	 */
+	seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
+		   page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
+
+	/* For each node, transverse the css tree to obtain the hierarichal
+	 * node usage.
+	 */
+	for_each_node_state(nid, N_MEMORY) {
+		usage = 0;
+		rcu_read_lock();
+		css_for_each_descendant_pre(css, &h_cg->css) {
+			usage += hugetlb_cgroup_from_css(css)
+					 ->nodeinfo[nid]
+					 ->usage[idx];
+		}
+		rcu_read_unlock();
+		seq_printf(seq, " N%d=%lu", nid, usage);
+	}
+
+	seq_putc(seq, '\n');
+
+	return 0;
+}
+
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
@@ -654,8 +721,14 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->seq_show = hugetlb_cgroup_read_u64_max;
 	cft->flags = CFTYPE_NOT_ON_ROOT;

-	/* Add the events file */
+	/* Add the numa stat file */
 	cft = &h->cgroup_files_dfl[4];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+	cft->seq_show = hugetlb_cgroup_read_numa_stat;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
+	/* Add the events file */
+	cft = &h->cgroup_files_dfl[5];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_show;
@@ -663,7 +736,7 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->flags = CFTYPE_NOT_ON_ROOT;

 	/* Add the events.local file */
-	cft = &h->cgroup_files_dfl[5];
+	cft = &h->cgroup_files_dfl[6];
 	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.events.local", buf);
 	cft->private = MEMFILE_PRIVATE(idx, 0);
 	cft->seq_show = hugetlb_events_local_show;
@@ -672,7 +745,7 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 	cft->flags = CFTYPE_NOT_ON_ROOT;

 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_dfl[6];
+	cft = &h->cgroup_files_dfl[7];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -742,8 +815,14 @@  static void __init __hugetlb_cgroup_file_legacy_init(int idx)
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the numa stat file */
+	cft = &h->cgroup_files_dfl[8];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+	cft->private = MEMFILE_PRIVATE(idx, 1);
+	cft->seq_show = hugetlb_cgroup_read_numa_stat;
+
 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_legacy[8];
+	cft = &h->cgroup_files_legacy[9];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c b/tools/testing/selftests/vm/write_to_hugetlbfs.c
index 6a2caba19ee1..d2da6315a40c 100644
--- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
+++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
@@ -37,8 +37,8 @@  static int shmid;
 static void exit_usage(void)
 {
 	printf("Usage: %s -p <path to hugetlbfs file> -s <size to map> "
-	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l] [-r] "
-	       "[-o] [-w] [-n]\n",
+	       "[-m <0=hugetlbfs | 1=mmap(MAP_HUGETLB)>] [-l(sleep)] [-r(private)] "
+	       "[-o(populate)] [-w(rite)] [-n(o-reserve)]\n",
 	       self);
 	exit(EXIT_FAILURE);
 }
@@ -161,6 +161,11 @@  int main(int argc, char **argv)
 	else
 		printf("RESERVE mapping.\n");

+	if (want_sleep)
+		printf("Sleeping\n");
+	else
+		printf("Not sleeping\n");
+
 	switch (method) {
 	case HUGETLBFS:
 		printf("Allocating using HUGETLBFS.\n");