diff mbox series

[v6] hugetlb: Add hugetlb.*.numa_stat file

Message ID 20211111015037.4092956-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series [v6] hugetlb: Add hugetlb.*.numa_stat file | expand

Commit Message

Mina Almasry Nov. 11, 2021, 1:50 a.m. UTC
For hugetlb backed jobs/VMs it's critical to understand the numa
information for the memory backing these jobs to deliver optimal
performance.

Currently this technically can be queried from /proc/self/numa_maps, but
there are significant issues with that. Namely:
1. Memory can be mapped or unmapped.
2. numa_maps are per process and need to be aggregated across all
   processes in the cgroup. For shared memory this is more involved as
   the userspace needs to make sure it doesn't double count shared
   mappings.
3. I believe querying numa_maps needs to hold the mmap_lock which adds
   to the contention on this lock.

For these reasons I propose simply adding hugetlb.*.numa_stat file,
which shows the numa information of the cgroup similarly to
memory.numa_stat.

On cgroup-v2:
   cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0

On cgroup-v1:
   cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
   total=2097152 N0=2097152 N1=0
   hierarichal_total=2097152 N0=2097152 N1=0

This patch was tested manually by allocating hugetlb memory and querying
the hugetlb.*.numa_stat file of the cgroup and its parents.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Jue Wang <juew@google.com>
Cc: Yang Yao <ygyao@google.com>
Cc: Joanna Li <joannali@google.com>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

Signed-off-by: Mina Almasry <almasrymina@google.com>

---

Changes in v6:
- Changed usage from unsigned long to atomic_long_t

Changes in v5:
- Fixed commit message typo.
- Fixed per node usage documentation to be in pages.
- Removed unnecessary h_cg check.

Changes in v4:
- Removed unnecessary braces.
- usage is now counted in pages instead of bytes.
- Reverted unneeded changes to write_to_hugetlbfs.c

Changes in v3:
- Fixed typos (sorry!)
- Used conventional locations for cgroups mount points in docs/commit
message.
- Updated docs.
- Handle kzalloc_node failure, and proper deallocation of per node data.
- Use struct_size() to calculate the struct size.
- Use nr_node_ids instead of MAX_NUMNODES.
- Updated comments per multi-line comment pattern.

Changes in v2:
- Fix warning Reported-by: kernel test robot <lkp@intel.com>
---
 .../admin-guide/cgroup-v1/hugetlb.rst         |   4 +
 Documentation/admin-guide/cgroup-v2.rst       |   5 +
 include/linux/hugetlb.h                       |   4 +-
 include/linux/hugetlb_cgroup.h                |   7 ++
 mm/hugetlb_cgroup.c                           | 118 ++++++++++++++++--
 5 files changed, 127 insertions(+), 11 deletions(-)

--
2.34.0.rc0.344.g81b53c2807-goog

Comments

Shakeel Butt Nov. 11, 2021, 2:04 a.m. UTC | #1
On Wed, Nov 10, 2021 at 5:50 PM Mina Almasry <almasrymina@google.com> wrote:
>
> For hugetlb backed jobs/VMs it's critical to understand the numa
> information for the memory backing these jobs to deliver optimal
> performance.
>
> Currently this technically can be queried from /proc/self/numa_maps, but
> there are significant issues with that. Namely:
> 1. Memory can be mapped or unmapped.
> 2. numa_maps are per process and need to be aggregated across all
>    processes in the cgroup. For shared memory this is more involved as
>    the userspace needs to make sure it doesn't double count shared
>    mappings.
> 3. I believe querying numa_maps needs to hold the mmap_lock which adds
>    to the contention on this lock.
>
> For these reasons I propose simply adding hugetlb.*.numa_stat file,
> which shows the numa information of the cgroup similarly to
> memory.numa_stat.
>
> On cgroup-v2:
>    cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>
> On cgroup-v1:
>    cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>    hierarichal_total=2097152 N0=2097152 N1=0
>
> This patch was tested manually by allocating hugetlb memory and querying
> the hugetlb.*.numa_stat file of the cgroup and its parents.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Jue Wang <juew@google.com>
> Cc: Yang Yao <ygyao@google.com>
> Cc: Joanna Li <joannali@google.com>
> Cc: Cannon Matthews <cannonmatthews@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>


Reviewed-by: Shakeel Butt <shakeelb@google.com>
Muchun Song Nov. 11, 2021, 2:36 a.m. UTC | #2
On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:
>
> For hugetlb backed jobs/VMs it's critical to understand the numa
> information for the memory backing these jobs to deliver optimal
> performance.
>
> Currently this technically can be queried from /proc/self/numa_maps, but
> there are significant issues with that. Namely:
> 1. Memory can be mapped or unmapped.
> 2. numa_maps are per process and need to be aggregated across all
>    processes in the cgroup. For shared memory this is more involved as
>    the userspace needs to make sure it doesn't double count shared
>    mappings.
> 3. I believe querying numa_maps needs to hold the mmap_lock which adds
>    to the contention on this lock.
>
> For these reasons I propose simply adding hugetlb.*.numa_stat file,
> which shows the numa information of the cgroup similarly to
> memory.numa_stat.
>
> On cgroup-v2:
>    cat /sys/fs/cgroup/unified/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>
> On cgroup-v1:
>    cat /sys/fs/cgroup/hugetlb/test/hugetlb.2MB.numa_stat
>    total=2097152 N0=2097152 N1=0
>    hierarichal_total=2097152 N0=2097152 N1=0
>
> This patch was tested manually by allocating hugetlb memory and querying
> the hugetlb.*.numa_stat file of the cgroup and its parents.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Miaohe Lin <linmiaohe@huawei.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Jue Wang <juew@google.com>
> Cc: Yang Yao <ygyao@google.com>
> Cc: Joanna Li <joannali@google.com>
> Cc: Cannon Matthews <cannonmatthews@google.com>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
>
> ---
>
> Changes in v6:
> - Changed usage from unsigned long to atomic_long_t
>
> Changes in v5:
> - Fixed commit message typo.
> - Fixed per node usage documentation to be in pages.
> - Removed unnecessary h_cg check.
>
> Changes in v4:
> - Removed unnecessary braces.
> - usage is now counted in pages instead of bytes.
> - Reverted unneeded changes to write_to_hugetlbfs.c
>
> Changes in v3:
> - Fixed typos (sorry!)
> - Used conventional locations for cgroups mount points in docs/commit
> message.
> - Updated docs.
> - Handle kzalloc_node failure, and proper deallocation of per node data.
> - Use struct_size() to calculate the struct size.
> - Use nr_node_ids instead of MAX_NUMNODES.
> - Updated comments per multi-line comment pattern.
>
> Changes in v2:
> - Fix warning Reported-by: kernel test robot <lkp@intel.com>
> ---
>  .../admin-guide/cgroup-v1/hugetlb.rst         |   4 +
>  Documentation/admin-guide/cgroup-v2.rst       |   5 +
>  include/linux/hugetlb.h                       |   4 +-
>  include/linux/hugetlb_cgroup.h                |   7 ++
>  mm/hugetlb_cgroup.c                           | 118 ++++++++++++++++--
>  5 files changed, 127 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> index 338f2c7d7a1c..0fa724d82abb 100644
> --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
> @@ -29,12 +29,14 @@ Brief summary of control files::
>   hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
>   hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
>   hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
> + hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup
>
>  For a system supporting three hugepage sizes (64k, 32M and 1G), the control
>  files include::
>
>    hugetlb.1GB.limit_in_bytes
>    hugetlb.1GB.max_usage_in_bytes
> +  hugetlb.1GB.numa_stat
>    hugetlb.1GB.usage_in_bytes
>    hugetlb.1GB.failcnt
>    hugetlb.1GB.rsvd.limit_in_bytes
> @@ -43,6 +45,7 @@ files include::
>    hugetlb.1GB.rsvd.failcnt
>    hugetlb.64KB.limit_in_bytes
>    hugetlb.64KB.max_usage_in_bytes
> +  hugetlb.64KB.numa_stat
>    hugetlb.64KB.usage_in_bytes
>    hugetlb.64KB.failcnt
>    hugetlb.64KB.rsvd.limit_in_bytes
> @@ -51,6 +54,7 @@ files include::
>    hugetlb.64KB.rsvd.failcnt
>    hugetlb.32MB.limit_in_bytes
>    hugetlb.32MB.max_usage_in_bytes
> +  hugetlb.32MB.numa_stat
>    hugetlb.32MB.usage_in_bytes
>    hugetlb.32MB.failcnt
>    hugetlb.32MB.rsvd.limit_in_bytes
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 4d8c27eca96b..356847f8f008 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2252,6 +2252,11 @@ HugeTLB Interface Files
>         are local to the cgroup i.e. not hierarchical. The file modified event
>         generated on this file reflects only the local events.
>
> +  hugetlb.<hugepagesize>.numa_stat
> +       Similar to memory.numa_stat, it shows the numa information of the
> +        hugetlb pages of <hugepagesize> in this cgroup.  Only active in
> +        use hugetlb pages are included.  The per-node values are in bytes.
> +
>  Misc
>  ----
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 1faebe1cd0ed..0445faaa636e 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -613,8 +613,8 @@ struct hstate {
>  #endif
>  #ifdef CONFIG_CGROUP_HUGETLB
>         /* cgroup control files */
> -       struct cftype cgroup_files_dfl[7];
> -       struct cftype cgroup_files_legacy[9];
> +       struct cftype cgroup_files_dfl[8];
> +       struct cftype cgroup_files_legacy[10];
>  #endif
>         char name[HSTATE_NAME_LEN];
>  };
> diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
> index c137396129db..13aa0210ff70 100644
> --- a/include/linux/hugetlb_cgroup.h
> +++ b/include/linux/hugetlb_cgroup.h
> @@ -36,6 +36,11 @@ enum hugetlb_memory_event {
>         HUGETLB_NR_MEMORY_EVENTS,
>  };
>
> +struct hugetlb_cgroup_per_node {
> +       /* hugetlb usage in pages over all hstates. */
> +       atomic_long_t usage[HUGE_MAX_HSTATE];

Why do you use atomic? IIUC, 'usage' is always
increased/decreased under hugetlb_lock except
hugetlb_cgroup_read_numa_stat() which is always
reading it. So I think WRITE_ONCE/READ_ONCE
is enough.

Thanks.

> +};
> +
>  struct hugetlb_cgroup {
>         struct cgroup_subsys_state css;
>
> @@ -57,6 +62,8 @@ struct hugetlb_cgroup {
>
>         /* Handle for "hugetlb.events.local" */
>         struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
> +
> +       struct hugetlb_cgroup_per_node *nodeinfo[];
>  };
>
>  static inline struct hugetlb_cgroup *
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index 5383023d0cca..be1f25a9ba28 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -126,29 +126,58 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
>         }
>  }
>
> +static void hugetlb_cgroup_free(struct hugetlb_cgroup *h_cgroup)
> +{
> +       int node;
> +
> +       for_each_node(node)
> +               kfree(h_cgroup->nodeinfo[node]);
> +       kfree(h_cgroup);
> +}
> +
>  static struct cgroup_subsys_state *
>  hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  {
>         struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
>         struct hugetlb_cgroup *h_cgroup;
> +       int node;
> +
> +       h_cgroup = kzalloc(struct_size(h_cgroup, nodeinfo, nr_node_ids),
> +                          GFP_KERNEL);
>
> -       h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
>         if (!h_cgroup)
>                 return ERR_PTR(-ENOMEM);
>
>         if (!parent_h_cgroup)
>                 root_h_cgroup = h_cgroup;
>
> +       /*
> +        * TODO: this routine can waste much memory for nodes which will
> +        * never be onlined. It's better to use memory hotplug callback
> +        * function.
> +        */
> +       for_each_node(node) {
> +               /* Set node_to_alloc to -1 for offline nodes. */
> +               int node_to_alloc =
> +                       node_state(node, N_NORMAL_MEMORY) ? node : -1;
> +               h_cgroup->nodeinfo[node] =
> +                       kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
> +                                    GFP_KERNEL, node_to_alloc);
> +               if (!h_cgroup->nodeinfo[node])
> +                       goto fail_alloc_nodeinfo;
> +       }
> +
>         hugetlb_cgroup_init(h_cgroup, parent_h_cgroup);
>         return &h_cgroup->css;
> +
> +fail_alloc_nodeinfo:
> +       hugetlb_cgroup_free(h_cgroup);
> +       return ERR_PTR(-ENOMEM);
>  }
>
>  static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
> -       struct hugetlb_cgroup *h_cgroup;
> -
> -       h_cgroup = hugetlb_cgroup_from_css(css);
> -       kfree(h_cgroup);
> +       hugetlb_cgroup_free(hugetlb_cgroup_from_css(css));
>  }
>
>  /*
> @@ -292,7 +321,9 @@ static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
>                 return;
>
>         __set_hugetlb_cgroup(page, h_cg, rsvd);
> -       return;
> +       if (!rsvd)
> +               atomic_long_add(nr_pages,
> +                               &h_cg->nodeinfo[page_to_nid(page)]->usage[idx]);
>  }
>
>  void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
> @@ -331,7 +362,9 @@ static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
>
>         if (rsvd)
>                 css_put(&h_cg->css);
> -
> +       else
> +               atomic_long_sub(nr_pages,
> +                               &h_cg->nodeinfo[page_to_nid(page)]->usage[idx]);
>         return;

Could you remove "return" by the way?

>  }
>
> @@ -421,6 +454,61 @@ enum {
>         RES_RSVD_FAILCNT,
>  };
>
> +static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
> +{
> +       int nid;
> +       struct cftype *cft = seq_cft(seq);
> +       int idx = MEMFILE_IDX(cft->private);
> +       bool legacy = MEMFILE_ATTR(cft->private);
> +       struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
> +       struct cgroup_subsys_state *css;
> +       unsigned long usage;
> +
> +       if (legacy) {
> +               /* Add up usage across all nodes for the non-hierarchical total. */
> +               usage = 0;
> +               for_each_node_state(nid, N_MEMORY)
> +                       usage += atomic_long_read(
> +                               &h_cg->nodeinfo[nid]->usage[idx]);
> +               seq_printf(seq, "total=%lu", usage * PAGE_SIZE);
> +
> +               /* Simply print the per-node usage for the non-hierarchical total. */
> +               for_each_node_state(nid, N_MEMORY)
> +                       seq_printf(seq, " N%d=%lu", nid,
> +                                  atomic_long_read(
> +                                          &h_cg->nodeinfo[nid]->usage[idx]) *
> +                                          PAGE_SIZE);
> +               seq_putc(seq, '\n');
> +       }
> +
> +       /*
> +        * The hierarchical total is pretty much the value recorded by the
> +        * counter, so use that.
> +        */
> +       seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
> +                  page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
> +
> +       /*
> +        * For each node, transverse the css tree to obtain the hierarichal
> +        * node usage.
> +        */
> +       for_each_node_state(nid, N_MEMORY) {
> +               usage = 0;
> +               rcu_read_lock();
> +               css_for_each_descendant_pre(css, &h_cg->css) {
> +                       usage += atomic_long_read(&hugetlb_cgroup_from_css(css)
> +                                                          ->nodeinfo[nid]
> +                                                          ->usage[idx]);
> +               }
> +               rcu_read_unlock();
> +               seq_printf(seq, " N%d=%lu", nid, usage * PAGE_SIZE);
> +       }
> +
> +       seq_putc(seq, '\n');
> +
> +       return 0;
> +}
> +
>  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
>                                    struct cftype *cft)
>  {
> @@ -671,8 +759,14 @@ static void __init __hugetlb_cgroup_file_dfl_init(int idx)
>                                     events_local_file[idx]);
>         cft->flags = CFTYPE_NOT_ON_ROOT;
>
> -       /* NULL terminate the last cft */
> +       /* Add the numa stat file */
>         cft = &h->cgroup_files_dfl[6];
> +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +       cft->flags = CFTYPE_NOT_ON_ROOT;
> +
> +       /* NULL terminate the last cft */
> +       cft = &h->cgroup_files_dfl[7];
>         memset(cft, 0, sizeof(*cft));
>
>         WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
> @@ -742,8 +836,14 @@ static void __init __hugetlb_cgroup_file_legacy_init(int idx)
>         cft->write = hugetlb_cgroup_reset;
>         cft->read_u64 = hugetlb_cgroup_read_u64;
>
> +       /* Add the numa stat file */
> +       cft = &h->cgroup_files_dfl[8];
> +       snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
> +       cft->private = MEMFILE_PRIVATE(idx, 1);
> +       cft->seq_show = hugetlb_cgroup_read_numa_stat;
> +
>         /* NULL terminate the last cft */
> -       cft = &h->cgroup_files_legacy[8];
> +       cft = &h->cgroup_files_legacy[9];
>         memset(cft, 0, sizeof(*cft));
>
>         WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
> --
> 2.34.0.rc0.344.g81b53c2807-goog
Shakeel Butt Nov. 11, 2021, 2:59 a.m. UTC | #3
On Wed, Nov 10, 2021 at 6:36 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
[...]
> > +struct hugetlb_cgroup_per_node {
> > +       /* hugetlb usage in pages over all hstates. */
> > +       atomic_long_t usage[HUGE_MAX_HSTATE];
>
> Why do you use atomic? IIUC, 'usage' is always
> increased/decreased under hugetlb_lock except
> hugetlb_cgroup_read_numa_stat() which is always
> reading it. So I think WRITE_ONCE/READ_ONCE
> is enough.

Oh this is me misguiding Mina, sorry about that. Yes, READ_ONCE()
should be good enough in hugetlb_cgroup_read_numa_stat().
Mike Kravetz Nov. 12, 2021, 11:36 p.m. UTC | #4
Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file

To:        Muchun Song <songmuchun@bytedance.com>, Mina Almasry <almasrymina@google.com>

Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>

Bcc:       

-=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-

On 11/10/21 6:36 PM, Muchun Song wrote:

> On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:

>>

>> +struct hugetlb_cgroup_per_node {

>> +       /* hugetlb usage in pages over all hstates. */

>> +       atomic_long_t usage[HUGE_MAX_HSTATE];

> 

> Why do you use atomic? IIUC, 'usage' is always

> increased/decreased under hugetlb_lock except

> hugetlb_cgroup_read_numa_stat() which is always

> reading it. So I think WRITE_ONCE/READ_ONCE

> is enough.



Thanks for continuing to work this, I was traveling and unable to

comment.



Unless I am missing something, I do not see a reason for WRITE_ONCE/READ_ONCE

and would suggest going back to the way this code was in v5.
Muchun Song Nov. 13, 2021, 2:44 a.m. UTC | #5
On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
>
> To:        Muchun Song <songmuchun@bytedance.com>, Mina Almasry <almasrymina@google.com>
>
> Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
>
> Bcc:
>
> -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
>
> On 11/10/21 6:36 PM, Muchun Song wrote:
>
> > On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:
>
> >>
>
> >> +struct hugetlb_cgroup_per_node {
>
> >> +       /* hugetlb usage in pages over all hstates. */
>
> >> +       atomic_long_t usage[HUGE_MAX_HSTATE];
>
> >
>
> > Why do you use atomic? IIUC, 'usage' is always
>
> > increased/decreased under hugetlb_lock except
>
> > hugetlb_cgroup_read_numa_stat() which is always
>
> > reading it. So I think WRITE_ONCE/READ_ONCE
>
> > is enough.
>
>
>
> Thanks for continuing to work this, I was traveling and unable to
>
> comment.

Have a good time.

>
>
>
> Unless I am missing something, I do not see a reason for WRITE_ONCE/READ_ONCE

Because __hugetlb_cgroup_commit_charge and
hugetlb_cgroup_read_numa_stat can run parallely,
which meets the definition of data race. I believe
KCSAN could report this race. I'm not strongly
suggest using WRITE/READ_ONCE() here. But
in theory it should be like this. Right?

Thanks.

>
> and would suggest going back to the way this code was in v5.
>
> --
>
> Mike Kravetz
>
Mina Almasry Nov. 13, 2021, 2:48 p.m. UTC | #6
On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
> >
> > To:        Muchun Song <songmuchun@bytedance.com>, Mina Almasry <almasrymina@google.com>
> >
> > Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
> >
> > Bcc:
> >
> > -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
> >
> > On 11/10/21 6:36 PM, Muchun Song wrote:
> >
> > > On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > >>
> >
> > >> +struct hugetlb_cgroup_per_node {
> >
> > >> +       /* hugetlb usage in pages over all hstates. */
> >
> > >> +       atomic_long_t usage[HUGE_MAX_HSTATE];
> >
> > >
> >
> > > Why do you use atomic? IIUC, 'usage' is always
> >
> > > increased/decreased under hugetlb_lock except
> >
> > > hugetlb_cgroup_read_numa_stat() which is always
> >
> > > reading it. So I think WRITE_ONCE/READ_ONCE
> >
> > > is enough.
> >
> >
> >
> > Thanks for continuing to work this, I was traveling and unable to
> >
> > comment.
>
> Have a good time.
>
> >
> >
> >
> > Unless I am missing something, I do not see a reason for WRITE_ONCE/READ_ONCE
>
> Because __hugetlb_cgroup_commit_charge and
> hugetlb_cgroup_read_numa_stat can run parallely,
> which meets the definition of data race. I believe
> KCSAN could report this race. I'm not strongly
> suggest using WRITE/READ_ONCE() here. But
> in theory it should be like this. Right?
>

My understanding is that the (only) potential problem here is
read_numa_stat() reading an intermediate garbage value while
commit_charge() is happening concurrently. This will only happen on
archs where the writes to an unsigned long aren't atomic. On archs
where writes to an unsigned long are atomic, there is no race, because
read_numa_stat() will only ever read the value before the concurrent
write or after the concurrent write, both of which are valid. To cater
to archs where the writes to unsigned long aren't atomic, we need to
use an atomic data type.

I'm not too familiar but my understanding from reading the
documentation is that WRITE_ONCE/READ_ONCE don't contribute anything
meaningful here:

/*
* Prevent the compiler from merging or refetching reads or writes. The
* compiler is also forbidden from reordering successive instances of
* READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some
* particular ordering. One way to make the compiler aware of ordering is to
* put the two invocations of READ_ONCE or WRITE_ONCE in different C
* statements.
...

I can't see a reason why we care about the compiler merging or
refetching reads or writes here. As far as I can tell the problem is
atomicy of the write.

> Thanks.
>
> >
> > and would suggest going back to the way this code was in v5.
> >
> > --
> >
> > Mike Kravetz
> >
Shakeel Butt Nov. 13, 2021, 7:15 p.m. UTC | #7
On Sat, Nov 13, 2021 at 6:48 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
> > >
> > > To:        Muchun Song <songmuchun@bytedance.com>, Mina Almasry <almasrymina@google.com>
> > >
> > > Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
> > >
> > > Bcc:
> > >
> > > -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
> > >
> > > On 11/10/21 6:36 PM, Muchun Song wrote:
> > >
> > > > On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:
> > >
> > > >>
> > >
> > > >> +struct hugetlb_cgroup_per_node {
> > >
> > > >> +       /* hugetlb usage in pages over all hstates. */
> > >
> > > >> +       atomic_long_t usage[HUGE_MAX_HSTATE];
> > >
> > > >
> > >
> > > > Why do you use atomic? IIUC, 'usage' is always
> > >
> > > > increased/decreased under hugetlb_lock except
> > >
> > > > hugetlb_cgroup_read_numa_stat() which is always
> > >
> > > > reading it. So I think WRITE_ONCE/READ_ONCE
> > >
> > > > is enough.
> > >
> > >
> > >
> > > Thanks for continuing to work this, I was traveling and unable to
> > >
> > > comment.
> >
> > Have a good time.
> >
> > >
> > >
> > >
> > > Unless I am missing something, I do not see a reason for WRITE_ONCE/READ_ONCE
> >
> > Because __hugetlb_cgroup_commit_charge and
> > hugetlb_cgroup_read_numa_stat can run parallely,
> > which meets the definition of data race. I believe
> > KCSAN could report this race. I'm not strongly
> > suggest using WRITE/READ_ONCE() here. But
> > in theory it should be like this. Right?
> >
>
> My understanding is that the (only) potential problem here is
> read_numa_stat() reading an intermediate garbage value while
> commit_charge() is happening concurrently. This will only happen on
> archs where the writes to an unsigned long aren't atomic. On archs
> where writes to an unsigned long are atomic, there is no race, because
> read_numa_stat() will only ever read the value before the concurrent
> write or after the concurrent write, both of which are valid. To cater
> to archs where the writes to unsigned long aren't atomic, we need to
> use an atomic data type.
>
> I'm not too familiar but my understanding from reading the
> documentation is that WRITE_ONCE/READ_ONCE don't contribute anything
> meaningful here:
>
> /*
> * Prevent the compiler from merging or refetching reads or writes. The
> * compiler is also forbidden from reordering successive instances of
> * READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some
> * particular ordering. One way to make the compiler aware of ordering is to
> * put the two invocations of READ_ONCE or WRITE_ONCE in different C
> * statements.
> ...
>
> I can't see a reason why we care about the compiler merging or
> refetching reads or writes here. As far as I can tell the problem is
> atomicy of the write.
>

We have following options:

1) Use atomic type for usage.
2) Use "unsigned long" for usage along with WRITE_ONCE/READ_ONCE.
3) Use hugetlb_lock for hugetlb_cgroup_read_numa_stat as well.

All options are valid but we would like to avoid (3).

What if we use "unsigned long" type but without READ_ONCE/WRITE_ONCE.
The potential issues with that are KCSAN will report this as race and
possible garbage value on archs which do not support atomic writes to
unsigned long.

Shakeel
Muchun Song Nov. 14, 2021, 1:43 p.m. UTC | #8
On Sun, Nov 14, 2021 at 3:15 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Sat, Nov 13, 2021 at 6:48 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
> > >
> > > On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > > >
> > > > Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
> > > >
> > > > To:        Muchun Song <songmuchun@bytedance.com>, Mina Almasry <almasrymina@google.com>
> > > >
> > > > Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
> > > >
> > > > Bcc:
> > > >
> > > > -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
> > > >
> > > > On 11/10/21 6:36 PM, Muchun Song wrote:
> > > >
> > > > > On Thu, Nov 11, 2021 at 9:50 AM Mina Almasry <almasrymina@google.com> wrote:
> > > >
> > > > >>
> > > >
> > > > >> +struct hugetlb_cgroup_per_node {
> > > >
> > > > >> +       /* hugetlb usage in pages over all hstates. */
> > > >
> > > > >> +       atomic_long_t usage[HUGE_MAX_HSTATE];
> > > >
> > > > >
> > > >
> > > > > Why do you use atomic? IIUC, 'usage' is always
> > > >
> > > > > increased/decreased under hugetlb_lock except
> > > >
> > > > > hugetlb_cgroup_read_numa_stat() which is always
> > > >
> > > > > reading it. So I think WRITE_ONCE/READ_ONCE
> > > >
> > > > > is enough.
> > > >
> > > >
> > > >
> > > > Thanks for continuing to work this, I was traveling and unable to
> > > >
> > > > comment.
> > >
> > > Have a good time.
> > >
> > > >
> > > >
> > > >
> > > > Unless I am missing something, I do not see a reason for WRITE_ONCE/READ_ONCE
> > >
> > > Because __hugetlb_cgroup_commit_charge and
> > > hugetlb_cgroup_read_numa_stat can run parallely,
> > > which meets the definition of data race. I believe
> > > KCSAN could report this race. I'm not strongly
> > > suggest using WRITE/READ_ONCE() here. But
> > > in theory it should be like this. Right?
> > >
> >
> > My understanding is that the (only) potential problem here is
> > read_numa_stat() reading an intermediate garbage value while
> > commit_charge() is happening concurrently. This will only happen on
> > archs where the writes to an unsigned long aren't atomic. On archs
> > where writes to an unsigned long are atomic, there is no race, because
> > read_numa_stat() will only ever read the value before the concurrent
> > write or after the concurrent write, both of which are valid. To cater
> > to archs where the writes to unsigned long aren't atomic, we need to
> > use an atomic data type.
> >
> > I'm not too familiar but my understanding from reading the
> > documentation is that WRITE_ONCE/READ_ONCE don't contribute anything
> > meaningful here:
> >
> > /*
> > * Prevent the compiler from merging or refetching reads or writes. The
> > * compiler is also forbidden from reordering successive instances of
> > * READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some
> > * particular ordering. One way to make the compiler aware of ordering is to
> > * put the two invocations of READ_ONCE or WRITE_ONCE in different C
> > * statements.
> > ...
> >
> > I can't see a reason why we care about the compiler merging or
> > refetching reads or writes here. As far as I can tell the problem is
> > atomicy of the write.
> >
>
> We have following options:
>
> 1) Use atomic type for usage.
> 2) Use "unsigned long" for usage along with WRITE_ONCE/READ_ONCE.
> 3) Use hugetlb_lock for hugetlb_cgroup_read_numa_stat as well.
>
> All options are valid but we would like to avoid (3).
>
> What if we use "unsigned long" type but without READ_ONCE/WRITE_ONCE.
> The potential issues with that are KCSAN will report this as race and
> possible garbage value on archs which do not support atomic writes to
> unsigned long.

At least I totally agree with you. Thanks for your detailed explanation.

>
> Shakeel
Mike Kravetz Nov. 15, 2021, 6:22 p.m. UTC | #9
Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file

To:        Muchun Song <songmuchun@bytedance.com>, Shakeel Butt <shakeelb@google.com>, Mina Almasry <almasrymina@google.com>

Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>

Bcc:       

-=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-

On 11/14/21 5:43 AM, Muchun Song wrote:

> On Sun, Nov 14, 2021 at 3:15 AM Shakeel Butt <shakeelb@google.com> wrote:

>> On Sat, Nov 13, 2021 at 6:48 AM Mina Almasry <almasrymina@google.com> wrote:

>>> On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:

>>>> On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:

>> We have following options:

>>

>> 1) Use atomic type for usage.

>> 2) Use "unsigned long" for usage along with WRITE_ONCE/READ_ONCE.

>> 3) Use hugetlb_lock for hugetlb_cgroup_read_numa_stat as well.

>>

>> All options are valid but we would like to avoid (3).

>>

>> What if we use "unsigned long" type but without READ_ONCE/WRITE_ONCE.

>> The potential issues with that are KCSAN will report this as race and

>> possible garbage value on archs which do not support atomic writes to

>> unsigned long.

> 

> At least I totally agree with you. Thanks for your detailed explanation.

> 



Thanks everyone.  This makes sense.



However, I should note that this same situation (updates to unsigned

long variables under lock and reads of the the same variable without

lock or READ/WRITE_ONCE) exists in hugetlb sysfs files today.  Not

suggesting that this makes it OK to ignore the potential issue.  Just

wanted to point this out.
Mina Almasry Nov. 15, 2021, 6:55 p.m. UTC | #10
On Mon, Nov 15, 2021 at 10:22 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
>
> To:        Muchun Song <songmuchun@bytedance.com>, Shakeel Butt <shakeelb@google.com>, Mina Almasry <almasrymina@google.com>
>
> Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
>
> Bcc:
>
> -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
>
> On 11/14/21 5:43 AM, Muchun Song wrote:
>
> > On Sun, Nov 14, 2021 at 3:15 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> >> On Sat, Nov 13, 2021 at 6:48 AM Mina Almasry <almasrymina@google.com> wrote:
>
> >>> On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> >>>> On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> >> We have following options:
>
> >>
>
> >> 1) Use atomic type for usage.
>
> >> 2) Use "unsigned long" for usage along with WRITE_ONCE/READ_ONCE.
>
> >> 3) Use hugetlb_lock for hugetlb_cgroup_read_numa_stat as well.
>
> >>
>
> >> All options are valid but we would like to avoid (3).
>
> >>
>
> >> What if we use "unsigned long" type but without READ_ONCE/WRITE_ONCE.
>
> >> The potential issues with that are KCSAN will report this as race and
>
> >> possible garbage value on archs which do not support atomic writes to
>
> >> unsigned long.
>
> >
>
> > At least I totally agree with you. Thanks for your detailed explanation.
>
> >
>
>
>
> Thanks everyone.  This makes sense.
>
>
>
> However, I should note that this same situation (updates to unsigned
>
> long variables under lock and reads of the the same variable without
>
> lock or READ/WRITE_ONCE) exists in hugetlb sysfs files today.  Not
>
> suggesting that this makes it OK to ignore the potential issue.  Just
>
> wanted to point this out.
>

Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
to provide atomicity to the write or read, just prevents the compiler
from re-ordering them. Is there something I'm missing, or is the
suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
warnings?

> --
>
> Mike Kravetz
>
Shakeel Butt Nov. 15, 2021, 7:59 p.m. UTC | #11
)

On Mon, Nov 15, 2021 at 10:55 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Mon, Nov 15, 2021 at 10:22 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > Subject:   Re: [PATCH v6] hugetlb: Add hugetlb.*.numa_stat file
> >
> > To:        Muchun Song <songmuchun@bytedance.com>, Shakeel Butt <shakeelb@google.com>, Mina Almasry <almasrymina@google.com>
> >
> > Cc:        Andrew Morton <akpm@linux-foundation.org>, Shuah Khan <shuah@kernel.org>, Miaohe Lin <linmiaohe@huawei.com>, Oscar Salvador <osalvador@suse.de>, Michal Hocko <mhocko@suse.com>, David Rientjes <rientjes@google.com>, Jue Wang <juew@google.com>, Yang Yao <ygyao@google.com>, Joanna Li <joannali@google.com>, Cannon Matthews <cannonmatthews@google.com>, Linux Memory Management List <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>
> >
> > Bcc:
> >
> > -=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-
> >
> > On 11/14/21 5:43 AM, Muchun Song wrote:
> >
> > > On Sun, Nov 14, 2021 at 3:15 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > >> On Sat, Nov 13, 2021 at 6:48 AM Mina Almasry <almasrymina@google.com> wrote:
> >
> > >>> On Fri, Nov 12, 2021 at 6:45 PM Muchun Song <songmuchun@bytedance.com> wrote:
> >
> > >>>> On Sat, Nov 13, 2021 at 7:36 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > >> We have following options:
> >
> > >>
> >
> > >> 1) Use atomic type for usage.
> >
> > >> 2) Use "unsigned long" for usage along with WRITE_ONCE/READ_ONCE.
> >
> > >> 3) Use hugetlb_lock for hugetlb_cgroup_read_numa_stat as well.
> >
> > >>
> >
> > >> All options are valid but we would like to avoid (3).
> >
> > >>
> >
> > >> What if we use "unsigned long" type but without READ_ONCE/WRITE_ONCE.
> >
> > >> The potential issues with that are KCSAN will report this as race and
> >
> > >> possible garbage value on archs which do not support atomic writes to
> >
> > >> unsigned long.
> >
> > >
> >
> > > At least I totally agree with you. Thanks for your detailed explanation.
> >
> > >
> >
> >
> >
> > Thanks everyone.  This makes sense.
> >
> >
> >
> > However, I should note that this same situation (updates to unsigned
> >
> > long variables under lock and reads of the the same variable without
> >
> > lock or READ/WRITE_ONCE) exists in hugetlb sysfs files today.  Not
> >
> > suggesting that this makes it OK to ignore the potential issue.  Just
> >
> > wanted to point this out.
> >
>
> Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
> to provide atomicity to the write or read, just prevents the compiler
> from re-ordering them. Is there something I'm missing, or is the
> suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
> warnings?
>

+Paul & Marco

Let's ask the experts.

We have a "unsigned long usage" variable that is updated within a lock
(hugetlb_lock) but is read without the lock.

Q1) I think KCSAN will complain about it and READ_ONCE() in the
unlocked read path should be good enough to silent KCSAN. So, the
question is should we still use WRITE_ONCE() as well for usage within
hugetlb_lock?

Q2) Second question is more about 64 bit archs breaking a 64 bit write
into two 32 bit writes. Is this a real issue? If yes, then the
combination of READ_ONCE()/WRITE_ONCE() are good enough for the given
use-case?

thanks,
Shakeel
Marco Elver Nov. 16, 2021, 12:04 p.m. UTC | #12
On Mon, Nov 15, 2021 at 11:59AM -0800, Shakeel Butt wrote:
> On Mon, Nov 15, 2021 at 10:55 AM Mina Almasry <almasrymina@google.com> wrote:
[...]
> > Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
> > to provide atomicity to the write or read, just prevents the compiler
> > from re-ordering them. Is there something I'm missing, or is the
> > suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
> > warnings?

It's actually the opposite: READ_ONCE/WRITE_ONCE provide very little
ordering (modulo dependencies) guarantees, which includes ordering by
compiler, but are supposed to provide atomicity (when used with properly
aligned types up to word size [1]; see __READ_ONCE for non-atomic
variant).

Some more background...

The warnings that KCSAN tells you about are "data races", which occur
when you have conflicting concurrent accesses, one of which is "plain"
and at least one write. I think [2] provides a reasonable summary of
data races and why we should care.

For Linux, our own memory model (LKMM) documents this [3], and says that
as long as concurrent operations are marked (non-plain; e.g. *ONCE),
there won't be any data races.

There are multiple reasons why data races are undesirable, one of which
is to avoid bad compiler transformations [4], because compilers are
oblivious to concurrency otherwise.

Why do marked operations avoid data races and prevent miscompiles?
Among other things, because they should be executed atomically. If they
weren't a lot of code would be buggy (there had been cases where the old
READ_ONCE could be used on data larger than word size, which certainly
weren't atomic, but this is no longer possible).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/rwonce.h#n35
[2] https://lwn.net/Articles/816850/#Why%20should%20we%20care%20about%20data%20races?
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1920
[4] https://lwn.net/Articles/793253/

Some rules of thumb when to use which marking:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt

In an ideal world, we'd have all intentionally concurrent accesses
marked. As-is, KCSAN will find:

A. Data race, where failure due to current compilers is unlikely
   (supposedly "benign"); merely marking the accesses appropriately is
   sufficient. Finding a crash for these will require a miscompilation,
   but otherwise look "benign" at the C-language level.

B. Race-condition bugs where the bug manifests as a data race, too --
   simply marking things doesn't fix the problem. These are the types of
   bugs where a data race would point out a more severe issue.

Right now we have way too much of type (A), which means looking for (B)
requires patience.

> +Paul & Marco
> 
> Let's ask the experts.
> 
> We have a "unsigned long usage" variable that is updated within a lock
> (hugetlb_lock) but is read without the lock.
>
> Q1) I think KCSAN will complain about it and READ_ONCE() in the
> unlocked read path should be good enough to silent KCSAN. So, the
> question is should we still use WRITE_ONCE() as well for usage within
> hugetlb_lock?

KCSAN's default config will forgive the lack of WRITE_ONCE().
Technically it's still a data race (which KCSAN can find with a config
change), but can be forgiven because compilers are less likely to cause
trouble for writes (background: https://lwn.net/Articles/816854/ bit
about "Unmarked writes (aligned and up to word size)...").

I would mark both if feasible, as it clearly documents the fact the
write can be read concurrently.

> Q2) Second question is more about 64 bit archs breaking a 64 bit write
> into two 32 bit writes. Is this a real issue? If yes, then the
> combination of READ_ONCE()/WRITE_ONCE() are good enough for the given
> use-case?

Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
and at least relieve you to not worry about it (and shift the burden to
WRITE_ONCE's implementation).

Thanks,
-- Marco
Mina Almasry Nov. 16, 2021, 8:48 p.m. UTC | #13
On Tue, Nov 16, 2021 at 4:04 AM Marco Elver <elver@google.com> wrote:
>
> On Mon, Nov 15, 2021 at 11:59AM -0800, Shakeel Butt wrote:
> > On Mon, Nov 15, 2021 at 10:55 AM Mina Almasry <almasrymina@google.com> wrote:
> [...]
> > > Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
> > > to provide atomicity to the write or read, just prevents the compiler
> > > from re-ordering them. Is there something I'm missing, or is the
> > > suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
> > > warnings?
>
> It's actually the opposite: READ_ONCE/WRITE_ONCE provide very little
> ordering (modulo dependencies) guarantees, which includes ordering by
> compiler, but are supposed to provide atomicity (when used with properly
> aligned types up to word size [1]; see __READ_ONCE for non-atomic
> variant).
>
> Some more background...
>
> The warnings that KCSAN tells you about are "data races", which occur
> when you have conflicting concurrent accesses, one of which is "plain"
> and at least one write. I think [2] provides a reasonable summary of
> data races and why we should care.
>
> For Linux, our own memory model (LKMM) documents this [3], and says that
> as long as concurrent operations are marked (non-plain; e.g. *ONCE),
> there won't be any data races.
>
> There are multiple reasons why data races are undesirable, one of which
> is to avoid bad compiler transformations [4], because compilers are
> oblivious to concurrency otherwise.
>
> Why do marked operations avoid data races and prevent miscompiles?
> Among other things, because they should be executed atomically. If they
> weren't a lot of code would be buggy (there had been cases where the old
> READ_ONCE could be used on data larger than word size, which certainly
> weren't atomic, but this is no longer possible).
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/rwonce.h#n35
> [2] https://lwn.net/Articles/816850/#Why%20should%20we%20care%20about%20data%20races?
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1920
> [4] https://lwn.net/Articles/793253/
>
> Some rules of thumb when to use which marking:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
>
> In an ideal world, we'd have all intentionally concurrent accesses
> marked. As-is, KCSAN will find:
>
> A. Data race, where failure due to current compilers is unlikely
>    (supposedly "benign"); merely marking the accesses appropriately is
>    sufficient. Finding a crash for these will require a miscompilation,
>    but otherwise look "benign" at the C-language level.
>
> B. Race-condition bugs where the bug manifests as a data race, too --
>    simply marking things doesn't fix the problem. These are the types of
>    bugs where a data race would point out a more severe issue.
>
> Right now we have way too much of type (A), which means looking for (B)
> requires patience.
>
> > +Paul & Marco
> >
> > Let's ask the experts.
> >
> > We have a "unsigned long usage" variable that is updated within a lock
> > (hugetlb_lock) but is read without the lock.
> >
> > Q1) I think KCSAN will complain about it and READ_ONCE() in the
> > unlocked read path should be good enough to silent KCSAN. So, the
> > question is should we still use WRITE_ONCE() as well for usage within
> > hugetlb_lock?
>
> KCSAN's default config will forgive the lack of WRITE_ONCE().
> Technically it's still a data race (which KCSAN can find with a config
> change), but can be forgiven because compilers are less likely to cause
> trouble for writes (background: https://lwn.net/Articles/816854/ bit
> about "Unmarked writes (aligned and up to word size)...").
>
> I would mark both if feasible, as it clearly documents the fact the
> write can be read concurrently.
>
> > Q2) Second question is more about 64 bit archs breaking a 64 bit write
> > into two 32 bit writes. Is this a real issue? If yes, then the
> > combination of READ_ONCE()/WRITE_ONCE() are good enough for the given
> > use-case?
>
> Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> and at least relieve you to not worry about it (and shift the burden to
> WRITE_ONCE's implementation).
>

Thank you very much for the detailed response. I can add READ_ONCE()
at the no-lock read site, that is no issue.

However, for the writes that happen while holding the lock, the write
is like so:
+               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;

And like so:
+               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;

I.e. they are increments/decrements. Sorry if I missed it but I can't
find an INC_ONCE(), and it seems wrong to me to do something like:

+               WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
+
h_cg->nodeinfo[page_to_nid(page)] + nr_pages);

I know we're holding a lock anyway so there is no race, but to the
casual reader this looks wrong as there is a race between the fetch of
the value and the WRITE_ONCE(). What to do here? Seems to me the most
reasonable thing to do is just READ_ONCE() and leave the write plain?


> Thanks,
> -- Marco
Shakeel Butt Nov. 16, 2021, 8:59 p.m. UTC | #14
On Tue, Nov 16, 2021 at 12:48 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 4:04 AM Marco Elver <elver@google.com> wrote:
> >
> > On Mon, Nov 15, 2021 at 11:59AM -0800, Shakeel Butt wrote:
> > > On Mon, Nov 15, 2021 at 10:55 AM Mina Almasry <almasrymina@google.com> wrote:
> > [...]
> > > > Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
> > > > to provide atomicity to the write or read, just prevents the compiler
> > > > from re-ordering them. Is there something I'm missing, or is the
> > > > suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
> > > > warnings?
> >
> > It's actually the opposite: READ_ONCE/WRITE_ONCE provide very little
> > ordering (modulo dependencies) guarantees, which includes ordering by
> > compiler, but are supposed to provide atomicity (when used with properly
> > aligned types up to word size [1]; see __READ_ONCE for non-atomic
> > variant).
> >
> > Some more background...
> >
> > The warnings that KCSAN tells you about are "data races", which occur
> > when you have conflicting concurrent accesses, one of which is "plain"
> > and at least one write. I think [2] provides a reasonable summary of
> > data races and why we should care.
> >
> > For Linux, our own memory model (LKMM) documents this [3], and says that
> > as long as concurrent operations are marked (non-plain; e.g. *ONCE),
> > there won't be any data races.
> >
> > There are multiple reasons why data races are undesirable, one of which
> > is to avoid bad compiler transformations [4], because compilers are
> > oblivious to concurrency otherwise.
> >
> > Why do marked operations avoid data races and prevent miscompiles?
> > Among other things, because they should be executed atomically. If they
> > weren't a lot of code would be buggy (there had been cases where the old
> > READ_ONCE could be used on data larger than word size, which certainly
> > weren't atomic, but this is no longer possible).
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/rwonce.h#n35
> > [2] https://lwn.net/Articles/816850/#Why%20should%20we%20care%20about%20data%20races?
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1920
> > [4] https://lwn.net/Articles/793253/
> >
> > Some rules of thumb when to use which marking:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
> >
> > In an ideal world, we'd have all intentionally concurrent accesses
> > marked. As-is, KCSAN will find:
> >
> > A. Data race, where failure due to current compilers is unlikely
> >    (supposedly "benign"); merely marking the accesses appropriately is
> >    sufficient. Finding a crash for these will require a miscompilation,
> >    but otherwise look "benign" at the C-language level.
> >
> > B. Race-condition bugs where the bug manifests as a data race, too --
> >    simply marking things doesn't fix the problem. These are the types of
> >    bugs where a data race would point out a more severe issue.
> >
> > Right now we have way too much of type (A), which means looking for (B)
> > requires patience.
> >
> > > +Paul & Marco
> > >
> > > Let's ask the experts.
> > >
> > > We have a "unsigned long usage" variable that is updated within a lock
> > > (hugetlb_lock) but is read without the lock.
> > >
> > > Q1) I think KCSAN will complain about it and READ_ONCE() in the
> > > unlocked read path should be good enough to silent KCSAN. So, the
> > > question is should we still use WRITE_ONCE() as well for usage within
> > > hugetlb_lock?
> >
> > KCSAN's default config will forgive the lack of WRITE_ONCE().
> > Technically it's still a data race (which KCSAN can find with a config
> > change), but can be forgiven because compilers are less likely to cause
> > trouble for writes (background: https://lwn.net/Articles/816854/ bit
> > about "Unmarked writes (aligned and up to word size)...").
> >
> > I would mark both if feasible, as it clearly documents the fact the
> > write can be read concurrently.
> >
> > > Q2) Second question is more about 64 bit archs breaking a 64 bit write
> > > into two 32 bit writes. Is this a real issue? If yes, then the
> > > combination of READ_ONCE()/WRITE_ONCE() are good enough for the given
> > > use-case?
> >
> > Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> > and at least relieve you to not worry about it (and shift the burden to
> > WRITE_ONCE's implementation).
> >
>
> Thank you very much for the detailed response. I can add READ_ONCE()
> at the no-lock read site, that is no issue.
>
> However, for the writes that happen while holding the lock, the write
> is like so:
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
>
> And like so:
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
>
> I.e. they are increments/decrements. Sorry if I missed it but I can't
> find an INC_ONCE(), and it seems wrong to me to do something like:
>
> +               WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
> +
> h_cg->nodeinfo[page_to_nid(page)] + nr_pages);
>
> I know we're holding a lock anyway so there is no race, but to the
> casual reader this looks wrong as there is a race between the fetch of
> the value and the WRITE_ONCE(). What to do here? Seems to me the most
> reasonable thing to do is just READ_ONCE() and leave the write plain?
>
>

How about atomic_long_t?
Marco Elver Nov. 16, 2021, 9:47 p.m. UTC | #15
On Tue, Nov 16, 2021 at 12:59PM -0800, Shakeel Butt wrote:
> On Tue, Nov 16, 2021 at 12:48 PM Mina Almasry <almasrymina@google.com> wrote:
[...]
> > > Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> > > and at least relieve you to not worry about it (and shift the burden to
> > > WRITE_ONCE's implementation).
> > >
> >
> > Thank you very much for the detailed response. I can add READ_ONCE()
> > at the no-lock read site, that is no issue.
> >
> > However, for the writes that happen while holding the lock, the write
> > is like so:
> > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
> >
> > And like so:
> > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
> >
> > I.e. they are increments/decrements. Sorry if I missed it but I can't
> > find an INC_ONCE(), and it seems wrong to me to do something like:
> >
> > +               WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
> > +
> > h_cg->nodeinfo[page_to_nid(page)] + nr_pages);

From what I gather there are no concurrent writers, right?

WRITE_ONCE(a, a + X) is perfectly fine. What it says is that you can
have concurrent readers here, but no concurrent writers (and KCSAN will
still check that). Maybe we need a more convenient macro for this idiom
one day..

Though I think for something like

	h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;

it seems there wants to be an temporary long* so that you could write
WRITE_ONCE(*usage, *usage + nr_pages) or something.

> > I know we're holding a lock anyway so there is no race, but to the
> > casual reader this looks wrong as there is a race between the fetch of
> > the value and the WRITE_ONCE(). What to do here? Seems to me the most
> > reasonable thing to do is just READ_ONCE() and leave the write plain?
> >
> >
> 
> How about atomic_long_t?

That would work of course; if this is very hot path code it might be
excessive if you don't have concurrent writers.

Looking at the patch in more detail, the counter is a stat counter that
can be read from a stat file, correct? Hypothetically, what would happen
if the reader of 'usage' reads approximate values?

If the answer is "nothing", then this could classify as an entirely
"benign" data race and you could only use data_race() on the reader and
leave the writers unmarked using normal +=/-=. Check if it fits
"Data-Racy Reads for Approximate Diagnostics" [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt#n74
Mina Almasry Nov. 16, 2021, 9:53 p.m. UTC | #16
On Tue, Nov 16, 2021 at 1:48 PM Marco Elver <elver@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 12:59PM -0800, Shakeel Butt wrote:
> > On Tue, Nov 16, 2021 at 12:48 PM Mina Almasry <almasrymina@google.com> wrote:
> [...]
> > > > Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> > > > and at least relieve you to not worry about it (and shift the burden to
> > > > WRITE_ONCE's implementation).
> > > >
> > >
> > > Thank you very much for the detailed response. I can add READ_ONCE()
> > > at the no-lock read site, that is no issue.
> > >
> > > However, for the writes that happen while holding the lock, the write
> > > is like so:
> > > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
> > >
> > > And like so:
> > > +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
> > >
> > > I.e. they are increments/decrements. Sorry if I missed it but I can't
> > > find an INC_ONCE(), and it seems wrong to me to do something like:
> > >
> > > +               WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
> > > +
> > > h_cg->nodeinfo[page_to_nid(page)] + nr_pages);
>
> From what I gather there are no concurrent writers, right?
>
> WRITE_ONCE(a, a + X) is perfectly fine. What it says is that you can
> have concurrent readers here, but no concurrent writers (and KCSAN will
> still check that). Maybe we need a more convenient macro for this idiom
> one day..
>
> Though I think for something like
>
>         h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
>
> it seems there wants to be an temporary long* so that you could write
> WRITE_ONCE(*usage, *usage + nr_pages) or something.
>

Ah, perfect, OK I can do this, and maybe add a comment explaining that
we don't have concurrent writers.

> > > I know we're holding a lock anyway so there is no race, but to the
> > > casual reader this looks wrong as there is a race between the fetch of
> > > the value and the WRITE_ONCE(). What to do here? Seems to me the most
> > > reasonable thing to do is just READ_ONCE() and leave the write plain?
> > >
> > >
> >
> > How about atomic_long_t?
>
> That would work of course; if this is very hot path code it might be
> excessive if you don't have concurrent writers.
>
> Looking at the patch in more detail, the counter is a stat counter that
> can be read from a stat file, correct? Hypothetically, what would happen
> if the reader of 'usage' reads approximate values?
>
> If the answer is "nothing", then this could classify as an entirely
> "benign" data race and you could only use data_race() on the reader and
> leave the writers unmarked using normal +=/-=. Check if it fits
> "Data-Racy Reads for Approximate Diagnostics" [1].
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt#n74

Thank you very much for your quick responses. I think if the usage
returns a garbage/approximate value once in a while people will notice
and I can see it causing issues. I think it's worth doing it
'properly' here. I'll upload another version with these changes.
Muchun Song Nov. 17, 2021, 5:54 a.m. UTC | #17
On Wed, Nov 17, 2021 at 4:48 AM Mina Almasry <almasrymina@google.com> wrote:
>
> On Tue, Nov 16, 2021 at 4:04 AM Marco Elver <elver@google.com> wrote:
> >
> > On Mon, Nov 15, 2021 at 11:59AM -0800, Shakeel Butt wrote:
> > > On Mon, Nov 15, 2021 at 10:55 AM Mina Almasry <almasrymina@google.com> wrote:
> > [...]
> > > > Sorry I'm still a bit confused. READ_ONCE/WRITE_ONCE isn't documented
> > > > to provide atomicity to the write or read, just prevents the compiler
> > > > from re-ordering them. Is there something I'm missing, or is the
> > > > suggestion to add READ_ONCE/WRITE_ONCE simply to supress the KCSAN
> > > > warnings?
> >
> > It's actually the opposite: READ_ONCE/WRITE_ONCE provide very little
> > ordering (modulo dependencies) guarantees, which includes ordering by
> > compiler, but are supposed to provide atomicity (when used with properly
> > aligned types up to word size [1]; see __READ_ONCE for non-atomic
> > variant).
> >
> > Some more background...
> >
> > The warnings that KCSAN tells you about are "data races", which occur
> > when you have conflicting concurrent accesses, one of which is "plain"
> > and at least one write. I think [2] provides a reasonable summary of
> > data races and why we should care.
> >
> > For Linux, our own memory model (LKMM) documents this [3], and says that
> > as long as concurrent operations are marked (non-plain; e.g. *ONCE),
> > there won't be any data races.
> >
> > There are multiple reasons why data races are undesirable, one of which
> > is to avoid bad compiler transformations [4], because compilers are
> > oblivious to concurrency otherwise.
> >
> > Why do marked operations avoid data races and prevent miscompiles?
> > Among other things, because they should be executed atomically. If they
> > weren't a lot of code would be buggy (there had been cases where the old
> > READ_ONCE could be used on data larger than word size, which certainly
> > weren't atomic, but this is no longer possible).
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/asm-generic/rwonce.h#n35
> > [2] https://lwn.net/Articles/816850/#Why%20should%20we%20care%20about%20data%20races?
> > [3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1920
> > [4] https://lwn.net/Articles/793253/
> >
> > Some rules of thumb when to use which marking:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt
> >
> > In an ideal world, we'd have all intentionally concurrent accesses
> > marked. As-is, KCSAN will find:
> >
> > A. Data race, where failure due to current compilers is unlikely
> >    (supposedly "benign"); merely marking the accesses appropriately is
> >    sufficient. Finding a crash for these will require a miscompilation,
> >    but otherwise look "benign" at the C-language level.
> >
> > B. Race-condition bugs where the bug manifests as a data race, too --
> >    simply marking things doesn't fix the problem. These are the types of
> >    bugs where a data race would point out a more severe issue.
> >
> > Right now we have way too much of type (A), which means looking for (B)
> > requires patience.
> >
> > > +Paul & Marco
> > >
> > > Let's ask the experts.
> > >
> > > We have a "unsigned long usage" variable that is updated within a lock
> > > (hugetlb_lock) but is read without the lock.
> > >
> > > Q1) I think KCSAN will complain about it and READ_ONCE() in the
> > > unlocked read path should be good enough to silent KCSAN. So, the
> > > question is should we still use WRITE_ONCE() as well for usage within
> > > hugetlb_lock?
> >
> > KCSAN's default config will forgive the lack of WRITE_ONCE().
> > Technically it's still a data race (which KCSAN can find with a config
> > change), but can be forgiven because compilers are less likely to cause
> > trouble for writes (background: https://lwn.net/Articles/816854/ bit
> > about "Unmarked writes (aligned and up to word size)...").
> >
> > I would mark both if feasible, as it clearly documents the fact the
> > write can be read concurrently.
> >
> > > Q2) Second question is more about 64 bit archs breaking a 64 bit write
> > > into two 32 bit writes. Is this a real issue? If yes, then the
> > > combination of READ_ONCE()/WRITE_ONCE() are good enough for the given
> > > use-case?
> >
> > Per above, probably unlikely, but allowed. WRITE_ONCE should prevent it,
> > and at least relieve you to not worry about it (and shift the burden to
> > WRITE_ONCE's implementation).
> >
>
> Thank you very much for the detailed response. I can add READ_ONCE()
> at the no-lock read site, that is no issue.
>
> However, for the writes that happen while holding the lock, the write
> is like so:
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] += nr_pages;
>
> And like so:
> +               h_cg->nodeinfo[page_to_nid(page)]->usage[idx] -= nr_pages;
>
> I.e. they are increments/decrements. Sorry if I missed it but I can't
> find an INC_ONCE(), and it seems wrong to me to do something like:
>
> +               WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
> +
> h_cg->nodeinfo[page_to_nid(page)] + nr_pages);

How about using a local variable to cache
h_cg->nodeinfo[page_to_nid(page)]->usage[idx],
like the following.

long usage = h_cg->nodeinfo[page_to_nid(page)]->usage[idx];

usage += nr_pages;
WRITE_ONCE(h_cg->nodeinfo[page_to_nid(page)]->usage[idx], usage);

Does this look more comfortable?

>
> I know we're holding a lock anyway so there is no race, but to the
> casual reader this looks wrong as there is a race between the fetch of
> the value and the WRITE_ONCE(). What to do here? Seems to me the most

It's not an issue, because fetching is a read operation and the
path of reading a stat file is also a read operation. Both are "plain"
operations.

> reasonable thing to do is just READ_ONCE() and leave the write plain?

I suggest using WRITE_ONCE() here and READ_ONCE() in the reading.

Thanks.

>
>
> > Thanks,
> > -- Marco
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 338f2c7d7a1c..0fa724d82abb 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -29,12 +29,14 @@  Brief summary of control files::
  hugetlb.<hugepagesize>.max_usage_in_bytes             # show max "hugepagesize" hugetlb  usage recorded
  hugetlb.<hugepagesize>.usage_in_bytes                 # show current usage for "hugepagesize" hugetlb
  hugetlb.<hugepagesize>.failcnt                        # show the number of allocation failure due to HugeTLB usage limit
+ hugetlb.<hugepagesize>.numa_stat                      # show the numa information of the hugetlb memory charged to this cgroup

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::

   hugetlb.1GB.limit_in_bytes
   hugetlb.1GB.max_usage_in_bytes
+  hugetlb.1GB.numa_stat
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
   hugetlb.1GB.rsvd.limit_in_bytes
@@ -43,6 +45,7 @@  files include::
   hugetlb.1GB.rsvd.failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
+  hugetlb.64KB.numa_stat
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
   hugetlb.64KB.rsvd.limit_in_bytes
@@ -51,6 +54,7 @@  files include::
   hugetlb.64KB.rsvd.failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
+  hugetlb.32MB.numa_stat
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
   hugetlb.32MB.rsvd.limit_in_bytes
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 4d8c27eca96b..356847f8f008 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2252,6 +2252,11 @@  HugeTLB Interface Files
 	are local to the cgroup i.e. not hierarchical. The file modified event
 	generated on this file reflects only the local events.

+  hugetlb.<hugepagesize>.numa_stat
+	Similar to memory.numa_stat, it shows the numa information of the
+        hugetlb pages of <hugepagesize> in this cgroup.  Only active in
+        use hugetlb pages are included.  The per-node values are in bytes.
+
 Misc
 ----

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 1faebe1cd0ed..0445faaa636e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -613,8 +613,8 @@  struct hstate {
 #endif
 #ifdef CONFIG_CGROUP_HUGETLB
 	/* cgroup control files */
-	struct cftype cgroup_files_dfl[7];
-	struct cftype cgroup_files_legacy[9];
+	struct cftype cgroup_files_dfl[8];
+	struct cftype cgroup_files_legacy[10];
 #endif
 	char name[HSTATE_NAME_LEN];
 };
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c137396129db..13aa0210ff70 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -36,6 +36,11 @@  enum hugetlb_memory_event {
 	HUGETLB_NR_MEMORY_EVENTS,
 };

+struct hugetlb_cgroup_per_node {
+	/* hugetlb usage in pages over all hstates. */
+	atomic_long_t usage[HUGE_MAX_HSTATE];
+};
+
 struct hugetlb_cgroup {
 	struct cgroup_subsys_state css;

@@ -57,6 +62,8 @@  struct hugetlb_cgroup {

 	/* Handle for "hugetlb.events.local" */
 	struct cgroup_file events_local_file[HUGE_MAX_HSTATE];
+
+	struct hugetlb_cgroup_per_node *nodeinfo[];
 };

 static inline struct hugetlb_cgroup *
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 5383023d0cca..be1f25a9ba28 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -126,29 +126,58 @@  static void hugetlb_cgroup_init(struct hugetlb_cgroup *h_cgroup,
 	}
 }

+static void hugetlb_cgroup_free(struct hugetlb_cgroup *h_cgroup)
+{
+	int node;
+
+	for_each_node(node)
+		kfree(h_cgroup->nodeinfo[node]);
+	kfree(h_cgroup);
+}
+
 static struct cgroup_subsys_state *
 hugetlb_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
 	struct hugetlb_cgroup *parent_h_cgroup = hugetlb_cgroup_from_css(parent_css);
 	struct hugetlb_cgroup *h_cgroup;
+	int node;
+
+	h_cgroup = kzalloc(struct_size(h_cgroup, nodeinfo, nr_node_ids),
+			   GFP_KERNEL);

-	h_cgroup = kzalloc(sizeof(*h_cgroup), GFP_KERNEL);
 	if (!h_cgroup)
 		return ERR_PTR(-ENOMEM);

 	if (!parent_h_cgroup)
 		root_h_cgroup = h_cgroup;

+	/*
+	 * TODO: this routine can waste much memory for nodes which will
+	 * never be onlined. It's better to use memory hotplug callback
+	 * function.
+	 */
+	for_each_node(node) {
+		/* Set node_to_alloc to -1 for offline nodes. */
+		int node_to_alloc =
+			node_state(node, N_NORMAL_MEMORY) ? node : -1;
+		h_cgroup->nodeinfo[node] =
+			kzalloc_node(sizeof(struct hugetlb_cgroup_per_node),
+				     GFP_KERNEL, node_to_alloc);
+		if (!h_cgroup->nodeinfo[node])
+			goto fail_alloc_nodeinfo;
+	}
+
 	hugetlb_cgroup_init(h_cgroup, parent_h_cgroup);
 	return &h_cgroup->css;
+
+fail_alloc_nodeinfo:
+	hugetlb_cgroup_free(h_cgroup);
+	return ERR_PTR(-ENOMEM);
 }

 static void hugetlb_cgroup_css_free(struct cgroup_subsys_state *css)
 {
-	struct hugetlb_cgroup *h_cgroup;
-
-	h_cgroup = hugetlb_cgroup_from_css(css);
-	kfree(h_cgroup);
+	hugetlb_cgroup_free(hugetlb_cgroup_from_css(css));
 }

 /*
@@ -292,7 +321,9 @@  static void __hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 		return;

 	__set_hugetlb_cgroup(page, h_cg, rsvd);
-	return;
+	if (!rsvd)
+		atomic_long_add(nr_pages,
+				&h_cg->nodeinfo[page_to_nid(page)]->usage[idx]);
 }

 void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
@@ -331,7 +362,9 @@  static void __hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,

 	if (rsvd)
 		css_put(&h_cg->css);
-
+	else
+		atomic_long_sub(nr_pages,
+				&h_cg->nodeinfo[page_to_nid(page)]->usage[idx]);
 	return;
 }

@@ -421,6 +454,61 @@  enum {
 	RES_RSVD_FAILCNT,
 };

+static int hugetlb_cgroup_read_numa_stat(struct seq_file *seq, void *dummy)
+{
+	int nid;
+	struct cftype *cft = seq_cft(seq);
+	int idx = MEMFILE_IDX(cft->private);
+	bool legacy = MEMFILE_ATTR(cft->private);
+	struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(seq_css(seq));
+	struct cgroup_subsys_state *css;
+	unsigned long usage;
+
+	if (legacy) {
+		/* Add up usage across all nodes for the non-hierarchical total. */
+		usage = 0;
+		for_each_node_state(nid, N_MEMORY)
+			usage += atomic_long_read(
+				&h_cg->nodeinfo[nid]->usage[idx]);
+		seq_printf(seq, "total=%lu", usage * PAGE_SIZE);
+
+		/* Simply print the per-node usage for the non-hierarchical total. */
+		for_each_node_state(nid, N_MEMORY)
+			seq_printf(seq, " N%d=%lu", nid,
+				   atomic_long_read(
+					   &h_cg->nodeinfo[nid]->usage[idx]) *
+					   PAGE_SIZE);
+		seq_putc(seq, '\n');
+	}
+
+	/*
+	 * The hierarchical total is pretty much the value recorded by the
+	 * counter, so use that.
+	 */
+	seq_printf(seq, "%stotal=%lu", legacy ? "hierarichal_" : "",
+		   page_counter_read(&h_cg->hugepage[idx]) * PAGE_SIZE);
+
+	/*
+	 * For each node, transverse the css tree to obtain the hierarichal
+	 * node usage.
+	 */
+	for_each_node_state(nid, N_MEMORY) {
+		usage = 0;
+		rcu_read_lock();
+		css_for_each_descendant_pre(css, &h_cg->css) {
+			usage += atomic_long_read(&hugetlb_cgroup_from_css(css)
+							   ->nodeinfo[nid]
+							   ->usage[idx]);
+		}
+		rcu_read_unlock();
+		seq_printf(seq, " N%d=%lu", nid, usage * PAGE_SIZE);
+	}
+
+	seq_putc(seq, '\n');
+
+	return 0;
+}
+
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
 				   struct cftype *cft)
 {
@@ -671,8 +759,14 @@  static void __init __hugetlb_cgroup_file_dfl_init(int idx)
 				    events_local_file[idx]);
 	cft->flags = CFTYPE_NOT_ON_ROOT;

-	/* NULL terminate the last cft */
+	/* Add the numa stat file */
 	cft = &h->cgroup_files_dfl[6];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+	cft->seq_show = hugetlb_cgroup_read_numa_stat;
+	cft->flags = CFTYPE_NOT_ON_ROOT;
+
+	/* NULL terminate the last cft */
+	cft = &h->cgroup_files_dfl[7];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
@@ -742,8 +836,14 @@  static void __init __hugetlb_cgroup_file_legacy_init(int idx)
 	cft->write = hugetlb_cgroup_reset;
 	cft->read_u64 = hugetlb_cgroup_read_u64;

+	/* Add the numa stat file */
+	cft = &h->cgroup_files_dfl[8];
+	snprintf(cft->name, MAX_CFTYPE_NAME, "%s.numa_stat", buf);
+	cft->private = MEMFILE_PRIVATE(idx, 1);
+	cft->seq_show = hugetlb_cgroup_read_numa_stat;
+
 	/* NULL terminate the last cft */
-	cft = &h->cgroup_files_legacy[8];
+	cft = &h->cgroup_files_legacy[9];
 	memset(cft, 0, sizeof(*cft));

 	WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,