Message ID | 20210309100717.253-3-songmuchun@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Use obj_cgroup APIs to charge kmem pages | expand |
On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote: > We want to reuse the obj_cgroup APIs to charge the kmem pages. > If we do that, we should store an object cgroup pointer to > page->memcg_data for the kmem pages. > > Finally, page->memcg_data can have 3 different meanings. > > 1) For the slab pages, page->memcg_data points to an object cgroups > vector. > > 2) For the kmem pages (exclude the slab pages), page->memcg_data > points to an object cgroup. > > 3) For the user pages (e.g. the LRU pages), page->memcg_data points > to a memory cgroup. > > Currently we always get the memory cgroup associated with a page via > page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it > has to be used in cases when it's not known if a page has an > associated memory cgroup pointer or an object cgroups vector. Because > the page->memcg_data of the kmem page is not pointing to a memory > cgroup in the later patch, the page_memcg() and page_memcg_rcu() > cannot be applicable for the kmem pages. In this patch, make > page_memcg() and page_memcg_rcu() no longer apply to the kmem pages. > We do not change the behavior of the page_memcg_check(), it is also > applicable for the kmem pages. > > In the end, there are 3 helpers to get the memcg associated with a page. > Usage is as follows. > > 1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU > pages). > > - page_memcg() > - page_memcg_rcu() > > 2) Get the memory cgroup associated with a page. It has to be used in > cases when it's not known if a page has an associated memory cgroup > pointer or an object cgroups vector. Returns NULL for slab pages or > uncharged pages. Otherwise, returns memory cgroup for charged pages > (e.g. the kmem pages, the LRU pages). > > - page_memcg_check() > > In some place, we use page_memcg() to check whether the page is charged. > Now introduce page_memcg_charged() helper to do that. > > This is a preparation for reparenting the kmem pages. > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > --- > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > mm/memcontrol.c | 23 +++++++++++++---------- > mm/page_alloc.c | 4 ++-- > 3 files changed, 42 insertions(+), 18 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e6dc793d587d..83cbcdcfcc92 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > +/* Return true for charged page, otherwise false. */ > +static inline bool page_memcg_charged(struct page *page) > +{ > + unsigned long memcg_data = page->memcg_data; > + > + VM_BUG_ON_PAGE(PageSlab(page), page); > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > + > + return !!memcg_data; > +} > + > /* > - * page_memcg - get the memory cgroup associated with a page > + * page_memcg - get the memory cgroup associated with a non-kmem page > * @page: a pointer to the page struct > * > * Returns a pointer to the memory cgroup associated with the page, > * or NULL. This function assumes that the page is known to have a > * proper memory cgroup pointer. It's not safe to call this function > - * against some type of pages, e.g. slab pages or ex-slab pages. > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab > + * pages. > * > * Any of the following ensures page and memcg binding stability: > * - the page lock > @@ -378,27 +390,31 @@ static inline struct mem_cgroup *page_memcg(struct page *page) > unsigned long memcg_data = page->memcg_data; > > VM_BUG_ON_PAGE(PageSlab(page), page); > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); > VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > } > > /* > - * page_memcg_rcu - locklessly get the memory cgroup associated with a page > + * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page > * @page: a pointer to the page struct > * > * Returns a pointer to the memory cgroup associated with the page, > * or NULL. This function assumes that the page is known to have a > * proper memory cgroup pointer. It's not safe to call this function > - * against some type of pages, e.g. slab pages or ex-slab pages. > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab > + * pages. > */ > static inline struct mem_cgroup *page_memcg_rcu(struct page *page) > { > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > + > VM_BUG_ON_PAGE(PageSlab(page), page); > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); > WARN_ON_ONCE(!rcu_read_lock_held()); > > - return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) & > - ~MEMCG_DATA_FLAGS_MASK); > + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > } > > /* > @@ -1072,6 +1088,11 @@ void mem_cgroup_split_huge_fixup(struct page *head); > > struct mem_cgroup; > > +static inline bool page_memcg_charged(struct page *page) > +{ > + return false; > +} > + > static inline struct mem_cgroup *page_memcg(struct page *page) > { > return NULL; > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index fc22da9805fb..e1dc73ceb98a 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, > int val) > { > struct page *head = compound_head(page); /* rmap on tail pages */ > - struct mem_cgroup *memcg = page_memcg(head); > + struct mem_cgroup *memcg; > pg_data_t *pgdat = page_pgdat(page); > struct lruvec *lruvec; > > + memcg = page_memcg_check(head); In general, this and the next patch look good to me (aside from some small things, commented separately). But I wonder if it's better to have two separate versions of __mod_lruvec_page_state() for kmem and non-kmem pages, rather then rely on PageMemcgKmem flag. It's a hot path, so if we can have fewer conditions here, that would be nice. I take a brief look (and could be wrong), but it seems like we know in advance which version should be used. Thanks! > /* Untracked pages have no memcg, no lruvec. Update only the node */ > if (!memcg) { > __mod_node_page_state(pgdat, idx, val); > @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) > */ > void __memcg_kmem_uncharge_page(struct page *page, int order) > { > - struct mem_cgroup *memcg = page_memcg(page); > + struct mem_cgroup *memcg; > unsigned int nr_pages = 1 << order; > > - if (!memcg) > + if (!page_memcg_charged(page)) > return; > > + memcg = page_memcg_check(page); > VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); > __memcg_kmem_uncharge(memcg, nr_pages); > page->memcg_data = 0; > @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug) > static void uncharge_page(struct page *page, struct uncharge_gather *ug) > { > unsigned long nr_pages; > + struct mem_cgroup *memcg; > > VM_BUG_ON_PAGE(PageLRU(page), page); > > - if (!page_memcg(page)) > + if (!page_memcg_charged(page)) > return; > > /* > * Nobody should be changing or seriously looking at > - * page_memcg(page) at this point, we have fully > - * exclusive access to the page. > + * page memcg at this point, we have fully exclusive > + * access to the page. > */ > - > - if (ug->memcg != page_memcg(page)) { > + memcg = page_memcg_check(page); > + if (ug->memcg != memcg) { > if (ug->memcg) { > uncharge_batch(ug); > uncharge_gather_clear(ug); > } > - ug->memcg = page_memcg(page); > + ug->memcg = memcg; > > /* pairs with css_put in uncharge_batch */ > css_get(&ug->memcg->css); > @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page) > return; > > /* Don't touch page->lru of any random page, pre-check: */ > - if (!page_memcg(page)) > + if (!page_memcg_charged(page)) > return; > > uncharge_gather_clear(&ug); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f10966e3b4a5..bcb58ae15e24 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page, > if (unlikely((unsigned long)page->mapping | > page_ref_count(page) | > #ifdef CONFIG_MEMCG > - (unsigned long)page_memcg(page) | > + page_memcg_charged(page) | > #endif > (page->flags & check_flags))) > return false; > @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) > bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; > } > #ifdef CONFIG_MEMCG > - if (unlikely(page_memcg(page))) > + if (unlikely(page_memcg_charged(page))) > bad_reason = "page still charged to cgroup"; > #endif > return bad_reason; > -- > 2.11.0 >
On Thu, Mar 11, 2021 at 3:58 AM Roman Gushchin <guro@fb.com> wrote: > > On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote: > > We want to reuse the obj_cgroup APIs to charge the kmem pages. > > If we do that, we should store an object cgroup pointer to > > page->memcg_data for the kmem pages. > > > > Finally, page->memcg_data can have 3 different meanings. > > > > 1) For the slab pages, page->memcg_data points to an object cgroups > > vector. > > > > 2) For the kmem pages (exclude the slab pages), page->memcg_data > > points to an object cgroup. > > > > 3) For the user pages (e.g. the LRU pages), page->memcg_data points > > to a memory cgroup. > > > > Currently we always get the memory cgroup associated with a page via > > page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it > > has to be used in cases when it's not known if a page has an > > associated memory cgroup pointer or an object cgroups vector. Because > > the page->memcg_data of the kmem page is not pointing to a memory > > cgroup in the later patch, the page_memcg() and page_memcg_rcu() > > cannot be applicable for the kmem pages. In this patch, make > > page_memcg() and page_memcg_rcu() no longer apply to the kmem pages. > > We do not change the behavior of the page_memcg_check(), it is also > > applicable for the kmem pages. > > > > In the end, there are 3 helpers to get the memcg associated with a page. > > Usage is as follows. > > > > 1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU > > pages). > > > > - page_memcg() > > - page_memcg_rcu() > > > > 2) Get the memory cgroup associated with a page. It has to be used in > > cases when it's not known if a page has an associated memory cgroup > > pointer or an object cgroups vector. Returns NULL for slab pages or > > uncharged pages. Otherwise, returns memory cgroup for charged pages > > (e.g. the kmem pages, the LRU pages). > > > > - page_memcg_check() > > > > In some place, we use page_memcg() to check whether the page is charged. > > Now introduce page_memcg_charged() helper to do that. > > > > This is a preparation for reparenting the kmem pages. > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > --- > > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > > mm/memcontrol.c | 23 +++++++++++++---------- > > mm/page_alloc.c | 4 ++-- > > 3 files changed, 42 insertions(+), 18 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index e6dc793d587d..83cbcdcfcc92 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > > > +/* Return true for charged page, otherwise false. */ > > +static inline bool page_memcg_charged(struct page *page) > > +{ > > + unsigned long memcg_data = page->memcg_data; > > + > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > + > > + return !!memcg_data; > > +} > > + > > /* > > - * page_memcg - get the memory cgroup associated with a page > > + * page_memcg - get the memory cgroup associated with a non-kmem page > > * @page: a pointer to the page struct > > * > > * Returns a pointer to the memory cgroup associated with the page, > > * or NULL. This function assumes that the page is known to have a > > * proper memory cgroup pointer. It's not safe to call this function > > - * against some type of pages, e.g. slab pages or ex-slab pages. > > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab > > + * pages. > > * > > * Any of the following ensures page and memcg binding stability: > > * - the page lock > > @@ -378,27 +390,31 @@ static inline struct mem_cgroup *page_memcg(struct page *page) > > unsigned long memcg_data = page->memcg_data; > > > > VM_BUG_ON_PAGE(PageSlab(page), page); > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); > > VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > > > return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > > } > > > > /* > > - * page_memcg_rcu - locklessly get the memory cgroup associated with a page > > + * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page > > * @page: a pointer to the page struct > > * > > * Returns a pointer to the memory cgroup associated with the page, > > * or NULL. This function assumes that the page is known to have a > > * proper memory cgroup pointer. It's not safe to call this function > > - * against some type of pages, e.g. slab pages or ex-slab pages. > > + * against some type of pages, e.g. slab pages, kmem pages or ex-slab > > + * pages. > > */ > > static inline struct mem_cgroup *page_memcg_rcu(struct page *page) > > { > > + unsigned long memcg_data = READ_ONCE(page->memcg_data); > > + > > VM_BUG_ON_PAGE(PageSlab(page), page); > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); > > WARN_ON_ONCE(!rcu_read_lock_held()); > > > > - return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) & > > - ~MEMCG_DATA_FLAGS_MASK); > > + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); > > } > > > > /* > > @@ -1072,6 +1088,11 @@ void mem_cgroup_split_huge_fixup(struct page *head); > > > > struct mem_cgroup; > > > > +static inline bool page_memcg_charged(struct page *page) > > +{ > > + return false; > > +} > > + > > static inline struct mem_cgroup *page_memcg(struct page *page) > > { > > return NULL; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index fc22da9805fb..e1dc73ceb98a 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, > > int val) > > { > > struct page *head = compound_head(page); /* rmap on tail pages */ > > - struct mem_cgroup *memcg = page_memcg(head); > > + struct mem_cgroup *memcg; > > pg_data_t *pgdat = page_pgdat(page); > > struct lruvec *lruvec; > > > > + memcg = page_memcg_check(head); > > In general, this and the next patch look good to me (aside from some small things, > commented separately). > > But I wonder if it's better to have two separate versions of __mod_lruvec_page_state() > for kmem and non-kmem pages, rather then rely on PageMemcgKmem flag. It's a hot path, > so if we can have fewer conditions here, that would be nice. > I take a brief look (and could be wrong), but it seems like we know in advance > which version should be used. Right. The user should know whether the page is kmem. IIUC, __mod_lruvec_kmem_state is the version of the kmem. It is enough to reuse it. And make __mod_lruvec_page_state() only suitable for non-kmem page. > > Thanks! > > > /* Untracked pages have no memcg, no lruvec. Update only the node */ > > if (!memcg) { > > __mod_node_page_state(pgdat, idx, val); > > @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) > > */ > > void __memcg_kmem_uncharge_page(struct page *page, int order) > > { > > - struct mem_cgroup *memcg = page_memcg(page); > > + struct mem_cgroup *memcg; > > unsigned int nr_pages = 1 << order; > > > > - if (!memcg) > > + if (!page_memcg_charged(page)) > > return; > > > > + memcg = page_memcg_check(page); > > VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); > > __memcg_kmem_uncharge(memcg, nr_pages); > > page->memcg_data = 0; > > @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug) > > static void uncharge_page(struct page *page, struct uncharge_gather *ug) > > { > > unsigned long nr_pages; > > + struct mem_cgroup *memcg; > > > > VM_BUG_ON_PAGE(PageLRU(page), page); > > > > - if (!page_memcg(page)) > > + if (!page_memcg_charged(page)) > > return; > > > > /* > > * Nobody should be changing or seriously looking at > > - * page_memcg(page) at this point, we have fully > > - * exclusive access to the page. > > + * page memcg at this point, we have fully exclusive > > + * access to the page. > > */ > > - > > - if (ug->memcg != page_memcg(page)) { > > + memcg = page_memcg_check(page); > > + if (ug->memcg != memcg) { > > if (ug->memcg) { > > uncharge_batch(ug); > > uncharge_gather_clear(ug); > > } > > - ug->memcg = page_memcg(page); > > + ug->memcg = memcg; > > > > /* pairs with css_put in uncharge_batch */ > > css_get(&ug->memcg->css); > > @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page) > > return; > > > > /* Don't touch page->lru of any random page, pre-check: */ > > - if (!page_memcg(page)) > > + if (!page_memcg_charged(page)) > > return; > > > > uncharge_gather_clear(&ug); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index f10966e3b4a5..bcb58ae15e24 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page, > > if (unlikely((unsigned long)page->mapping | > > page_ref_count(page) | > > #ifdef CONFIG_MEMCG > > - (unsigned long)page_memcg(page) | > > + page_memcg_charged(page) | > > #endif > > (page->flags & check_flags))) > > return false; > > @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) > > bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; > > } > > #ifdef CONFIG_MEMCG > > - if (unlikely(page_memcg(page))) > > + if (unlikely(page_memcg_charged(page))) > > bad_reason = "page still charged to cgroup"; > > #endif > > return bad_reason; > > -- > > 2.11.0 > >
On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote: > We want to reuse the obj_cgroup APIs to charge the kmem pages. > If we do that, we should store an object cgroup pointer to > page->memcg_data for the kmem pages. > > Finally, page->memcg_data can have 3 different meanings. > > 1) For the slab pages, page->memcg_data points to an object cgroups > vector. > > 2) For the kmem pages (exclude the slab pages), page->memcg_data > points to an object cgroup. > > 3) For the user pages (e.g. the LRU pages), page->memcg_data points > to a memory cgroup. > > Currently we always get the memory cgroup associated with a page via > page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it > has to be used in cases when it's not known if a page has an > associated memory cgroup pointer or an object cgroups vector. Because > the page->memcg_data of the kmem page is not pointing to a memory > cgroup in the later patch, the page_memcg() and page_memcg_rcu() > cannot be applicable for the kmem pages. In this patch, make > page_memcg() and page_memcg_rcu() no longer apply to the kmem pages. > We do not change the behavior of the page_memcg_check(), it is also > applicable for the kmem pages. > > In the end, there are 3 helpers to get the memcg associated with a page. > Usage is as follows. > > 1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU > pages). > > - page_memcg() > - page_memcg_rcu() > > 2) Get the memory cgroup associated with a page. It has to be used in > cases when it's not known if a page has an associated memory cgroup > pointer or an object cgroups vector. Returns NULL for slab pages or > uncharged pages. Otherwise, returns memory cgroup for charged pages > (e.g. the kmem pages, the LRU pages). > > - page_memcg_check() > > In some place, we use page_memcg() to check whether the page is charged. > Now introduce page_memcg_charged() helper to do that. > > This is a preparation for reparenting the kmem pages. > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> I'm pretty excited about the direction this series is taking us. The direct/raw pinning of memcgs from objects and allocations of various lifetimes has been causing chronic problems with dying cgroups piling up, consuming memory, and gumming up the works in everything that needs to iterate the cgroup tree (page reclaim comes to mind). The more allocation types we can convert to objcg, the better. This patch in particular looks mostly good to me too. Some comments inline: > --- > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > mm/memcontrol.c | 23 +++++++++++++---------- > mm/page_alloc.c | 4 ++-- > 3 files changed, 42 insertions(+), 18 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index e6dc793d587d..83cbcdcfcc92 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > +/* Return true for charged page, otherwise false. */ > +static inline bool page_memcg_charged(struct page *page) > +{ > + unsigned long memcg_data = page->memcg_data; > + > + VM_BUG_ON_PAGE(PageSlab(page), page); > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > + > + return !!memcg_data; > +} This is mosntly used right before a page_memcg_check(), which makes it somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages. But it's also a bit of a confusing name: slab pages are charged too, but this function would crash if you called it on one. In light of this, and in light of what I wrote above about hopefully converting more and more allocations from raw memcg pins to reparentable objcg, it would be bettor to have page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem page_objcg() for 1:n page-memcg mappings, i.e. slab pages page_memcg_check() for the very rare ambiguous cases drop page_memcg_rcu() since page_memcg() is now rcu-safe If we wanted to optimize, we could identify places that could do a page_memcg_raw() that does page->memcg_data & ~MEMCG_DATA_FLAGS_MASK - without READ_ONCE and without the kmem branch. However, I think the stat functions are probably the hottest path when it comes to that, and they now already include the kmem branch*. * Roman mentioned splitting up the stats interface to optimize that, but I think we should be careful optimizing prematurely here. It's a bit of a maintainability concern, and it would also get in the way of easily converting more allocation types to objcg. > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, > int val) > { > struct page *head = compound_head(page); /* rmap on tail pages */ > - struct mem_cgroup *memcg = page_memcg(head); > + struct mem_cgroup *memcg; > pg_data_t *pgdat = page_pgdat(page); > struct lruvec *lruvec; > > + memcg = page_memcg_check(head); With the proposed variants above, this should be page_memcg() and actually warn/crash when we pass a slab page to this function. > @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) > */ > void __memcg_kmem_uncharge_page(struct page *page, int order) > { > - struct mem_cgroup *memcg = page_memcg(page); > + struct mem_cgroup *memcg; > unsigned int nr_pages = 1 << order; > > - if (!memcg) > + if (!page_memcg_charged(page)) > return; > > + memcg = page_memcg_check(page); This would remain unchanged: memcg = page_memcg(page); if (!memcg) return; > @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug) > static void uncharge_page(struct page *page, struct uncharge_gather *ug) > { > unsigned long nr_pages; > + struct mem_cgroup *memcg; > > VM_BUG_ON_PAGE(PageLRU(page), page); > > - if (!page_memcg(page)) > + if (!page_memcg_charged(page)) > return; > > /* > * Nobody should be changing or seriously looking at > - * page_memcg(page) at this point, we have fully > - * exclusive access to the page. > + * page memcg at this point, we have fully exclusive > + * access to the page. > */ > - > - if (ug->memcg != page_memcg(page)) { > + memcg = page_memcg_check(page); Same situation here: memcg = page_memcg(page); if (!memcg) return; > @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page) > return; > > /* Don't touch page->lru of any random page, pre-check: */ > - if (!page_memcg(page)) > + if (!page_memcg_charged(page)) > return; Same: if (!page_memcg(page)) return; > uncharge_gather_clear(&ug); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f10966e3b4a5..bcb58ae15e24 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page, > if (unlikely((unsigned long)page->mapping | > page_ref_count(page) | > #ifdef CONFIG_MEMCG > - (unsigned long)page_memcg(page) | > + page_memcg_charged(page) | Actually, I think we might want to just check the raw page->memcg_data here, as neither lru, nor kmem, nor slab page should have anything left in there by the time the page is freed. > @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) > bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; > } > #ifdef CONFIG_MEMCG > - if (unlikely(page_memcg(page))) > + if (unlikely(page_memcg_charged(page))) > bad_reason = "page still charged to cgroup"; Same here.
On Tue, Mar 9, 2021 at 2:09 AM Muchun Song <songmuchun@bytedance.com> wrote: > > We want to reuse the obj_cgroup APIs to charge the kmem pages. > If we do that, we should store an object cgroup pointer to > page->memcg_data for the kmem pages. > > Finally, page->memcg_data can have 3 different meanings. replace 'can' with 'will' Other than that I think Roman and Johannes have already given very good feedback.
On Fri, Mar 12, 2021 at 11:22 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Tue, Mar 9, 2021 at 2:09 AM Muchun Song <songmuchun@bytedance.com> wrote: > > > > We want to reuse the obj_cgroup APIs to charge the kmem pages. > > If we do that, we should store an object cgroup pointer to > > page->memcg_data for the kmem pages. > > > > Finally, page->memcg_data can have 3 different meanings. > > replace 'can' with 'will' Will do. Thanks. > > Other than that I think Roman and Johannes have already given very > good feedback.
On Thu, Mar 11, 2021 at 9:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote: > > We want to reuse the obj_cgroup APIs to charge the kmem pages. > > If we do that, we should store an object cgroup pointer to > > page->memcg_data for the kmem pages. > > > > Finally, page->memcg_data can have 3 different meanings. > > > > 1) For the slab pages, page->memcg_data points to an object cgroups > > vector. > > > > 2) For the kmem pages (exclude the slab pages), page->memcg_data > > points to an object cgroup. > > > > 3) For the user pages (e.g. the LRU pages), page->memcg_data points > > to a memory cgroup. > > > > Currently we always get the memory cgroup associated with a page via > > page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it > > has to be used in cases when it's not known if a page has an > > associated memory cgroup pointer or an object cgroups vector. Because > > the page->memcg_data of the kmem page is not pointing to a memory > > cgroup in the later patch, the page_memcg() and page_memcg_rcu() > > cannot be applicable for the kmem pages. In this patch, make > > page_memcg() and page_memcg_rcu() no longer apply to the kmem pages. > > We do not change the behavior of the page_memcg_check(), it is also > > applicable for the kmem pages. > > > > In the end, there are 3 helpers to get the memcg associated with a page. > > Usage is as follows. > > > > 1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU > > pages). > > > > - page_memcg() > > - page_memcg_rcu() > > > > 2) Get the memory cgroup associated with a page. It has to be used in > > cases when it's not known if a page has an associated memory cgroup > > pointer or an object cgroups vector. Returns NULL for slab pages or > > uncharged pages. Otherwise, returns memory cgroup for charged pages > > (e.g. the kmem pages, the LRU pages). > > > > - page_memcg_check() > > > > In some place, we use page_memcg() to check whether the page is charged. > > Now introduce page_memcg_charged() helper to do that. > > > > This is a preparation for reparenting the kmem pages. > > > > Signed-off-by: Muchun Song <songmuchun@bytedance.com> > > I'm pretty excited about the direction this series is taking us. The > direct/raw pinning of memcgs from objects and allocations of various > lifetimes has been causing chronic problems with dying cgroups piling > up, consuming memory, and gumming up the works in everything that > needs to iterate the cgroup tree (page reclaim comes to mind). > > The more allocation types we can convert to objcg, the better. > > This patch in particular looks mostly good to me too. Some comments > inline: Hi Johannes, Very thanks for your suggestions. But I have some questions as below. > > > --- > > include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ > > mm/memcontrol.c | 23 +++++++++++++---------- > > mm/page_alloc.c | 4 ++-- > > 3 files changed, 42 insertions(+), 18 deletions(-) > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index e6dc793d587d..83cbcdcfcc92 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > > > +/* Return true for charged page, otherwise false. */ > > +static inline bool page_memcg_charged(struct page *page) > > +{ > > + unsigned long memcg_data = page->memcg_data; > > + > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > + > > + return !!memcg_data; > > +} > > This is mosntly used right before a page_memcg_check(), which makes it > somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages. Should I rename page_memcg_charged to page_memcg_raw? And use page_memcg_raw to check whether the page is charged. static inline bool page_memcg_charged(struct page *page) { return page->memcg_data; } > > But it's also a bit of a confusing name: slab pages are charged too, > but this function would crash if you called it on one. > > In light of this, and in light of what I wrote above about hopefully > converting more and more allocations from raw memcg pins to > reparentable objcg, it would be bettor to have > > page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem Sorry. I do not get the point. Because in the next patch, the kmem page will use objcg to charge memory. So the page_memcg() should not be suitable for the kmem pages. So I add a VM_BUG_ON in the page_memcg() to catch invalid usage. So I changed some page_memcg() calling to page_memcg_check() in this patch, but you suggest using page_memcg(). I am very confused. Are you worried about the extra overhead brought by calling page_memcg_rcu()? In the next patch, I will remove page_memcg_check() calling and use objcg APIs. > page_objcg() for 1:n page-memcg mappings, i.e. slab pages > page_memcg_check() for the very rare ambiguous cases > drop page_memcg_rcu() since page_memcg() is now rcu-safe ^^^ page_memcg_check() Here you mean page_memcg_check()? Right? I see a READ_ONCE in page_memcg_check(), but page_memcg() doesn't. > > If we wanted to optimize, we could identify places that could do a > page_memcg_raw() that does page->memcg_data & ~MEMCG_DATA_FLAGS_MASK - > without READ_ONCE and without the kmem branch. However, I think the > stat functions are probably the hottest path when it comes to that, > and they now already include the kmem branch*. > > * Roman mentioned splitting up the stats interface to optimize that, > but I think we should be careful optimizing prematurely here. It's a > bit of a maintainability concern, and it would also get in the way > of easily converting more allocation types to objcg. > > > @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, > > int val) > > { > > struct page *head = compound_head(page); /* rmap on tail pages */ > > - struct mem_cgroup *memcg = page_memcg(head); > > + struct mem_cgroup *memcg; > > pg_data_t *pgdat = page_pgdat(page); > > struct lruvec *lruvec; > > > > + memcg = page_memcg_check(head); > > With the proposed variants above, this should be page_memcg() and > actually warn/crash when we pass a slab page to this function. > > > @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) > > */ > > void __memcg_kmem_uncharge_page(struct page *page, int order) > > { > > - struct mem_cgroup *memcg = page_memcg(page); > > + struct mem_cgroup *memcg; > > unsigned int nr_pages = 1 << order; > > > > - if (!memcg) > > + if (!page_memcg_charged(page)) > > return; > > > > + memcg = page_memcg_check(page); > > This would remain unchanged: > > memcg = page_memcg(page); > if (!memcg) > return; > > > @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug) > > static void uncharge_page(struct page *page, struct uncharge_gather *ug) > > { > > unsigned long nr_pages; > > + struct mem_cgroup *memcg; > > > > VM_BUG_ON_PAGE(PageLRU(page), page); > > > > - if (!page_memcg(page)) > > + if (!page_memcg_charged(page)) > > return; > > > > /* > > * Nobody should be changing or seriously looking at > > - * page_memcg(page) at this point, we have fully > > - * exclusive access to the page. > > + * page memcg at this point, we have fully exclusive > > + * access to the page. > > */ > > - > > - if (ug->memcg != page_memcg(page)) { > > + memcg = page_memcg_check(page); > > Same situation here: > > memcg = page_memcg(page); > if (!memcg) > return; > > > @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page) > > return; > > > > /* Don't touch page->lru of any random page, pre-check: */ > > - if (!page_memcg(page)) > > + if (!page_memcg_charged(page)) > > return; > > Same: > > if (!page_memcg(page)) > return; > > > uncharge_gather_clear(&ug); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index f10966e3b4a5..bcb58ae15e24 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page, > > if (unlikely((unsigned long)page->mapping | > > page_ref_count(page) | > > #ifdef CONFIG_MEMCG > > - (unsigned long)page_memcg(page) | > > + page_memcg_charged(page) | > > Actually, I think we might want to just check the raw > > page->memcg_data > > here, as neither lru, nor kmem, nor slab page should have anything > left in there by the time the page is freed. > > > @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) > > bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; > > } > > #ifdef CONFIG_MEMCG > > - if (unlikely(page_memcg(page))) > > + if (unlikely(page_memcg_charged(page))) > > bad_reason = "page still charged to cgroup"; > > Same here.
Hello Muchun, On Fri, Mar 12, 2021 at 03:14:07PM +0800, Muchun Song wrote: > On Thu, Mar 11, 2021 at 9:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > > > > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > > > > > +/* Return true for charged page, otherwise false. */ > > > +static inline bool page_memcg_charged(struct page *page) > > > +{ > > > + unsigned long memcg_data = page->memcg_data; > > > + > > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > > + > > > + return !!memcg_data; > > > +} > > > > This is mosntly used right before a page_memcg_check(), which makes it > > somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages. > > Should I rename page_memcg_charged to page_memcg_raw? > And use page_memcg_raw to check whether the page is charged. > > static inline bool page_memcg_charged(struct page *page) > { > return page->memcg_data; > } You can just directly access page->memcg_data in places where you'd use this helper. I think it's only the two places in mm/page_alloc.c, and they already have CONFIG_MEMCG in place, so raw access works. > > But it's also a bit of a confusing name: slab pages are charged too, > > but this function would crash if you called it on one. > > > > In light of this, and in light of what I wrote above about hopefully > > converting more and more allocations from raw memcg pins to > > reparentable objcg, it would be bettor to have > > > > page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem > > Sorry. I do not get the point. Because in the next patch, the kmem > page will use objcg to charge memory. So the page_memcg() > should not be suitable for the kmem pages. So I add a VM_BUG_ON > in the page_memcg() to catch invalid usage. > > So I changed some page_memcg() calling to page_memcg_check() > in this patch, but you suggest using page_memcg(). It would be better if page_memcg() worked on LRU and kmem pages. I'm proposing to change its implementation. The reason is that page_memcg_check() allows everything and does no sanity checking. You need page_memcg_charged() for the sanity checks that it's LRU or kmem, but that's a bit difficult to understand, and it's possible people will add more callsites to page_memcg_check() without the page_memcg_charged() checks. It makes the code less safe. We should discourage page_memcg_check() and make page_memcg() more useful instead. > I am very confused. Are you worried about the extra overhead brought > by calling page_memcg_rcu()? In the next patch, I will remove > page_memcg_check() calling and use objcg APIs. I'm just worried about the usability of the interface. It should be easy to use, and make it obvious if there is a user bug. For example, in your next patch, mod_lruvec_page_state does this: if (PageMemcgKmem(head)) { rcu_read_lock(); memcg = obj_cgroup_memcg(page_objcg(page)); } else { memcg = page_memcg(head); /* * Untracked pages have no memcg, no lruvec. Update only the * node. */ if (!memcg) { __mod_node_page_state(pgdat, idx, val); return; } } lruvec = mem_cgroup_lruvec(memcg, pgdat); __mod_lruvec_state(lruvec, idx, val); if (PageMemcgKmem(head)) rcu_read_unlock(); I'm proposing to implement page_memcg() in a way where you can do this: rcu_read_lock(); memcg = page_memcg(page); if (!memcg) { rcu_read_unlock(); __mod_node_page_state(); return; } lruvec = mem_cgroup_lruvec(memcg, pgdat); __mod_lruvec_state(lruvec, idx, val); rcu_read_unlock(); [ page_memcg() is: if (PageMemcgKmem(page)) return obj_cgroup_memcg(__page_objcg(page)); else return __page_memcg(page); and __page_objcg() and __page_memcg() do the raw page->memcg_data translation and the VM_BUG_ON_PAGE() checks for MEMCG_DATA_* ] This is a lot simpler and less error prone. It does take rcu_read_lock() for LRU pages too, which strictly it doesn't need to right now. But it's cheap enough (and actually saves a branch). Longer term we most likely need it there anyway. The issue you are describing in the cover letter - allocations pinning memcgs for a long time - it exists at a larger scale and is causing recurring problems in the real world: page cache doesn't get reclaimed for a long time, or is used by the second, third, fourth, ... instance of the same job that was restarted into a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, and make page reclaim very inefficient. We likely have to convert LRU pages and most other raw memcg pins to the objcg direction to fix this problem, and then the page->memcg lookup will always require the rcu read lock anyway. Finally, a universal page_memcg() should also make uncharge_page() simpler. Instead of obj_cgroup_memcg_get(), you could use the new page_memcg() to implement a universal get_mem_cgroup_from_page(): rcu_read_lock(); retry: memcg = page_memcg(page); if (unlikely(!css_tryget(&memcg->css))) goto retry; rcu_read_unlock(); return memcg; and then uncharge_page() becomes something like this: /* Look up page's memcg & prepare the batch */ memcg = get_mem_cgroup_from_page(page); if (!memcg) return; if (ug->memcg != memcg) { ... css_get(&memcg->css); /* batch ref, put in uncharge_batch() */ } mem_cgroup_put(memcg); /* Add page to batch */ nr_pages = compound_nr(page); ... /* Clear the page->memcg link */ if (PageMemcgKmem(page)) obj_cgroup_put(__page_objcg(page)); else css_put(__page_memcg(&memcg->css)); page->memcg_data = 0; Does that sound reasonable? PS: We have several page_memcg() callsites that could use the raw __page_memcg() directly for now. Is it worth switching them and saving the branch? I think probably not, because these paths aren't hot, and as per above, we should switch them to objcg sooner or later anyway.
Hi Johannes, On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > [...] > > Longer term we most likely need it there anyway. The issue you are > describing in the cover letter - allocations pinning memcgs for a long > time - it exists at a larger scale and is causing recurring problems > in the real world: page cache doesn't get reclaimed for a long time, > or is used by the second, third, fourth, ... instance of the same job > that was restarted into a new cgroup every time. Unreclaimable dying > cgroups pile up, waste memory, and make page reclaim very inefficient. > For the scenario described above, do we really want to reparent the page cache pages? Shouldn't we recharge the pages to the second, third, fourth and so on, memcgs? My concern is that we will see a big chunk of page cache pages charged to root and will only get reclaimed on global pressure.
On Fri, Mar 12, 2021 at 02:42:45PM -0800, Shakeel Butt wrote: > Hi Johannes, > > On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > [...] > > > > Longer term we most likely need it there anyway. The issue you are > > describing in the cover letter - allocations pinning memcgs for a long > > time - it exists at a larger scale and is causing recurring problems > > in the real world: page cache doesn't get reclaimed for a long time, > > or is used by the second, third, fourth, ... instance of the same job > > that was restarted into a new cgroup every time. Unreclaimable dying > > cgroups pile up, waste memory, and make page reclaim very inefficient. > > > > For the scenario described above, do we really want to reparent the > page cache pages? Shouldn't we recharge the pages to the second, > third, fourth and so on, memcgs? My concern is that we will see a big > chunk of page cache pages charged to root and will only get reclaimed > on global pressure. Sorry, I'm proposing to reparent to the ancestor, not root. It's an optimization, not a change in user-visible behavior. As far as the user can tell, the pages already belong to the parent after deletion: they'll show up in the parent's stats, naturally, and they will get reclaimed as part of the parent being reclaimed. The dead cgroup doesn't even have its own limit anymore after .css_reset() has run. And we already physically reparent slab objects in memcg_reparent_objcgs() and memcg_drain_all_list_lrus(). I'm just saying we should do the same thing for LRU pages.
On Fri, Mar 12, 2021 at 3:07 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Mar 12, 2021 at 02:42:45PM -0800, Shakeel Butt wrote: > > Hi Johannes, > > > > On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > [...] > > > > > > Longer term we most likely need it there anyway. The issue you are > > > describing in the cover letter - allocations pinning memcgs for a long > > > time - it exists at a larger scale and is causing recurring problems > > > in the real world: page cache doesn't get reclaimed for a long time, > > > or is used by the second, third, fourth, ... instance of the same job > > > that was restarted into a new cgroup every time. Unreclaimable dying > > > cgroups pile up, waste memory, and make page reclaim very inefficient. > > > > > > > For the scenario described above, do we really want to reparent the > > page cache pages? Shouldn't we recharge the pages to the second, > > third, fourth and so on, memcgs? My concern is that we will see a big > > chunk of page cache pages charged to root and will only get reclaimed > > on global pressure. > > Sorry, I'm proposing to reparent to the ancestor, not root. It's an > optimization, not a change in user-visible behavior. > > As far as the user can tell, the pages already belong to the parent > after deletion: they'll show up in the parent's stats, naturally, and > they will get reclaimed as part of the parent being reclaimed. > > The dead cgroup doesn't even have its own limit anymore after > .css_reset() has run. And we already physically reparent slab objects > in memcg_reparent_objcgs() and memcg_drain_all_list_lrus(). > > I'm just saying we should do the same thing for LRU pages. I understand the proposal and I agree it makes total sense when a job is recycling sub-job/sub-container. I was talking about the (recycling of the) top level cgroups. Though for that to be an issue, I suppose the file system has to be shared between the jobs on the system. I was wondering if a page cache reaches the root memcg on multiple reparenting, should the next access cause that page to be charged to the accessor?
On Sat, Mar 13, 2021 at 3:23 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hello Muchun, > > On Fri, Mar 12, 2021 at 03:14:07PM +0800, Muchun Song wrote: > > On Thu, Mar 11, 2021 at 9:12 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > @@ -358,14 +358,26 @@ enum page_memcg_data_flags { > > > > > > > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > > > > > > > +/* Return true for charged page, otherwise false. */ > > > > +static inline bool page_memcg_charged(struct page *page) > > > > +{ > > > > + unsigned long memcg_data = page->memcg_data; > > > > + > > > > + VM_BUG_ON_PAGE(PageSlab(page), page); > > > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); > > > > + > > > > + return !!memcg_data; > > > > +} > > > > > > This is mosntly used right before a page_memcg_check(), which makes it > > > somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages. > > > > Should I rename page_memcg_charged to page_memcg_raw? > > And use page_memcg_raw to check whether the page is charged. > > > > static inline bool page_memcg_charged(struct page *page) > > { > > return page->memcg_data; > > } > > You can just directly access page->memcg_data in places where you'd > use this helper. I think it's only the two places in mm/page_alloc.c, > and they already have CONFIG_MEMCG in place, so raw access works. OK. > > > > But it's also a bit of a confusing name: slab pages are charged too, > > > but this function would crash if you called it on one. > > > > > > In light of this, and in light of what I wrote above about hopefully > > > converting more and more allocations from raw memcg pins to > > > reparentable objcg, it would be bettor to have > > > > > > page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem > > > > Sorry. I do not get the point. Because in the next patch, the kmem > > page will use objcg to charge memory. So the page_memcg() > > should not be suitable for the kmem pages. So I add a VM_BUG_ON > > in the page_memcg() to catch invalid usage. > > > > So I changed some page_memcg() calling to page_memcg_check() > > in this patch, but you suggest using page_memcg(). > > It would be better if page_memcg() worked on LRU and kmem pages. I'm > proposing to change its implementation. > > The reason is that page_memcg_check() allows everything and does no > sanity checking. You need page_memcg_charged() for the sanity checks > that it's LRU or kmem, but that's a bit difficult to understand, and > it's possible people will add more callsites to page_memcg_check() > without the page_memcg_charged() checks. It makes the code less safe. > > We should discourage page_memcg_check() and make page_memcg() more > useful instead. > > > I am very confused. Are you worried about the extra overhead brought > > by calling page_memcg_rcu()? In the next patch, I will remove > > page_memcg_check() calling and use objcg APIs. > > I'm just worried about the usability of the interface. It should be > easy to use, and make it obvious if there is a user bug. > > For example, in your next patch, mod_lruvec_page_state does this: > > if (PageMemcgKmem(head)) { > rcu_read_lock(); > memcg = obj_cgroup_memcg(page_objcg(page)); > } else { > memcg = page_memcg(head); > /* > * Untracked pages have no memcg, no lruvec. Update only the > * node. > */ > if (!memcg) { > __mod_node_page_state(pgdat, idx, val); > return; > } > } > > lruvec = mem_cgroup_lruvec(memcg, pgdat); > __mod_lruvec_state(lruvec, idx, val); > > if (PageMemcgKmem(head)) > rcu_read_unlock(); > > I'm proposing to implement page_memcg() in a way where you can do this: > > rcu_read_lock(); > memcg = page_memcg(page); > if (!memcg) { > rcu_read_unlock(); > __mod_node_page_state(); > return; > } > lruvec = mem_cgroup_lruvec(memcg, pgdat); > __mod_lruvec_state(lruvec, idx, val); > rcu_read_unlock(); > > [ page_memcg() is: > > if (PageMemcgKmem(page)) > return obj_cgroup_memcg(__page_objcg(page)); > else > return __page_memcg(page); > > and __page_objcg() and __page_memcg() do the raw page->memcg_data > translation and the VM_BUG_ON_PAGE() checks for MEMCG_DATA_* ] Thanks for your suggestions. I will rework the code like this. > > This is a lot simpler and less error prone. > > It does take rcu_read_lock() for LRU pages too, which strictly it > doesn't need to right now. But it's cheap enough (and actually saves a > branch). > > Longer term we most likely need it there anyway. The issue you are > describing in the cover letter - allocations pinning memcgs for a long > time - it exists at a larger scale and is causing recurring problems > in the real world: page cache doesn't get reclaimed for a long time, > or is used by the second, third, fourth, ... instance of the same job > that was restarted into a new cgroup every time. Unreclaimable dying > cgroups pile up, waste memory, and make page reclaim very inefficient. > > We likely have to convert LRU pages and most other raw memcg pins to > the objcg direction to fix this problem, and then the page->memcg > lookup will always require the rcu read lock anyway. Yeah. I agree with you. I am doing this (it is already on my todo list). > > Finally, a universal page_memcg() should also make uncharge_page() > simpler. Instead of obj_cgroup_memcg_get(), you could use the new > page_memcg() to implement a universal get_mem_cgroup_from_page(): > > rcu_read_lock(); > retry: > memcg = page_memcg(page); > if (unlikely(!css_tryget(&memcg->css))) > goto retry; > rcu_read_unlock(); > return memcg; > > and then uncharge_page() becomes something like this: > > /* Look up page's memcg & prepare the batch */ > memcg = get_mem_cgroup_from_page(page); > if (!memcg) > return; > if (ug->memcg != memcg) { > ... > css_get(&memcg->css); /* batch ref, put in uncharge_batch() */ > } > mem_cgroup_put(memcg); > > /* Add page to batch */ > nr_pages = compound_nr(page); > ... > > /* Clear the page->memcg link */ > if (PageMemcgKmem(page)) > obj_cgroup_put(__page_objcg(page)); > else > css_put(__page_memcg(&memcg->css)); > page->memcg_data = 0; > > Does that sound reasonable? Make sense to me. > > PS: We have several page_memcg() callsites that could use the raw > __page_memcg() directly for now. Is it worth switching them and saving > the branch? I think probably not, because these paths aren't hot, and > as per above, we should switch them to objcg sooner or later anyway. Got it. Very thanks for your explanation.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index e6dc793d587d..83cbcdcfcc92 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -358,14 +358,26 @@ enum page_memcg_data_flags { #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) +/* Return true for charged page, otherwise false. */ +static inline bool page_memcg_charged(struct page *page) +{ + unsigned long memcg_data = page->memcg_data; + + VM_BUG_ON_PAGE(PageSlab(page), page); + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); + + return !!memcg_data; +} + /* - * page_memcg - get the memory cgroup associated with a page + * page_memcg - get the memory cgroup associated with a non-kmem page * @page: a pointer to the page struct * * Returns a pointer to the memory cgroup associated with the page, * or NULL. This function assumes that the page is known to have a * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages. + * against some type of pages, e.g. slab pages, kmem pages or ex-slab + * pages. * * Any of the following ensures page and memcg binding stability: * - the page lock @@ -378,27 +390,31 @@ static inline struct mem_cgroup *page_memcg(struct page *page) unsigned long memcg_data = page->memcg_data; VM_BUG_ON_PAGE(PageSlab(page), page); + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page); return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } /* - * page_memcg_rcu - locklessly get the memory cgroup associated with a page + * page_memcg_rcu - locklessly get the memory cgroup associated with a non-kmem page * @page: a pointer to the page struct * * Returns a pointer to the memory cgroup associated with the page, * or NULL. This function assumes that the page is known to have a * proper memory cgroup pointer. It's not safe to call this function - * against some type of pages, e.g. slab pages or ex-slab pages. + * against some type of pages, e.g. slab pages, kmem pages or ex-slab + * pages. */ static inline struct mem_cgroup *page_memcg_rcu(struct page *page) { + unsigned long memcg_data = READ_ONCE(page->memcg_data); + VM_BUG_ON_PAGE(PageSlab(page), page); + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page); WARN_ON_ONCE(!rcu_read_lock_held()); - return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) & - ~MEMCG_DATA_FLAGS_MASK); + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK); } /* @@ -1072,6 +1088,11 @@ void mem_cgroup_split_huge_fixup(struct page *head); struct mem_cgroup; +static inline bool page_memcg_charged(struct page *page) +{ + return false; +} + static inline struct mem_cgroup *page_memcg(struct page *page) { return NULL; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index fc22da9805fb..e1dc73ceb98a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, int val) { struct page *head = compound_head(page); /* rmap on tail pages */ - struct mem_cgroup *memcg = page_memcg(head); + struct mem_cgroup *memcg; pg_data_t *pgdat = page_pgdat(page); struct lruvec *lruvec; + memcg = page_memcg_check(head); /* Untracked pages have no memcg, no lruvec. Update only the node */ if (!memcg) { __mod_node_page_state(pgdat, idx, val); @@ -3166,12 +3167,13 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order) */ void __memcg_kmem_uncharge_page(struct page *page, int order) { - struct mem_cgroup *memcg = page_memcg(page); + struct mem_cgroup *memcg; unsigned int nr_pages = 1 << order; - if (!memcg) + if (!page_memcg_charged(page)) return; + memcg = page_memcg_check(page); VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page); __memcg_kmem_uncharge(memcg, nr_pages); page->memcg_data = 0; @@ -6827,24 +6829,25 @@ static void uncharge_batch(const struct uncharge_gather *ug) static void uncharge_page(struct page *page, struct uncharge_gather *ug) { unsigned long nr_pages; + struct mem_cgroup *memcg; VM_BUG_ON_PAGE(PageLRU(page), page); - if (!page_memcg(page)) + if (!page_memcg_charged(page)) return; /* * Nobody should be changing or seriously looking at - * page_memcg(page) at this point, we have fully - * exclusive access to the page. + * page memcg at this point, we have fully exclusive + * access to the page. */ - - if (ug->memcg != page_memcg(page)) { + memcg = page_memcg_check(page); + if (ug->memcg != memcg) { if (ug->memcg) { uncharge_batch(ug); uncharge_gather_clear(ug); } - ug->memcg = page_memcg(page); + ug->memcg = memcg; /* pairs with css_put in uncharge_batch */ css_get(&ug->memcg->css); @@ -6877,7 +6880,7 @@ void mem_cgroup_uncharge(struct page *page) return; /* Don't touch page->lru of any random page, pre-check: */ - if (!page_memcg(page)) + if (!page_memcg_charged(page)) return; uncharge_gather_clear(&ug); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f10966e3b4a5..bcb58ae15e24 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1124,7 +1124,7 @@ static inline bool page_expected_state(struct page *page, if (unlikely((unsigned long)page->mapping | page_ref_count(page) | #ifdef CONFIG_MEMCG - (unsigned long)page_memcg(page) | + page_memcg_charged(page) | #endif (page->flags & check_flags))) return false; @@ -1149,7 +1149,7 @@ static const char *page_bad_reason(struct page *page, unsigned long flags) bad_reason = "PAGE_FLAGS_CHECK_AT_FREE flag(s) set"; } #ifdef CONFIG_MEMCG - if (unlikely(page_memcg(page))) + if (unlikely(page_memcg_charged(page))) bad_reason = "page still charged to cgroup"; #endif return bad_reason;
We want to reuse the obj_cgroup APIs to charge the kmem pages. If we do that, we should store an object cgroup pointer to page->memcg_data for the kmem pages. Finally, page->memcg_data can have 3 different meanings. 1) For the slab pages, page->memcg_data points to an object cgroups vector. 2) For the kmem pages (exclude the slab pages), page->memcg_data points to an object cgroup. 3) For the user pages (e.g. the LRU pages), page->memcg_data points to a memory cgroup. Currently we always get the memory cgroup associated with a page via page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it has to be used in cases when it's not known if a page has an associated memory cgroup pointer or an object cgroups vector. Because the page->memcg_data of the kmem page is not pointing to a memory cgroup in the later patch, the page_memcg() and page_memcg_rcu() cannot be applicable for the kmem pages. In this patch, make page_memcg() and page_memcg_rcu() no longer apply to the kmem pages. We do not change the behavior of the page_memcg_check(), it is also applicable for the kmem pages. In the end, there are 3 helpers to get the memcg associated with a page. Usage is as follows. 1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU pages). - page_memcg() - page_memcg_rcu() 2) Get the memory cgroup associated with a page. It has to be used in cases when it's not known if a page has an associated memory cgroup pointer or an object cgroups vector. Returns NULL for slab pages or uncharged pages. Otherwise, returns memory cgroup for charged pages (e.g. the kmem pages, the LRU pages). - page_memcg_check() In some place, we use page_memcg() to check whether the page is charged. Now introduce page_memcg_charged() helper to do that. This is a preparation for reparenting the kmem pages. Signed-off-by: Muchun Song <songmuchun@bytedance.com> --- include/linux/memcontrol.h | 33 +++++++++++++++++++++++++++------ mm/memcontrol.c | 23 +++++++++++++---------- mm/page_alloc.c | 4 ++-- 3 files changed, 42 insertions(+), 18 deletions(-)