[RFC] mm, kvm: account kvm_vcpu_mmap to kmemcg

Message ID	20190329012836.47013-1-shakeelb@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of 30nsdxagkcb0la3d77e49hh9e7.5hfebgnq-ffdo35d.hk9@flex--shakeelb.bounces.google.com designates 209.85.220.73 as permitted sender) client-ip=209.85.220.73; Date: Thu, 28 Mar 2019 18:28:36 -0700 Message-Id: <20190329012836.47013-1-shakeelb@google.com> Mime-Version: 1.0 Subject: [RFC PATCH] mm, kvm: account kvm_vcpu_mmap to kmemcg From: Shakeel Butt <shakeelb@google.com> To: Johannes Weiner <hannes@cmpxchg.org>, Vladimir Davydov <vdavydov.dev@gmail.com>, Michal Hocko <mhocko@suse.com>, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Paolo Bonzini <pbonzini@redhat.com>, Ben Gardon <bgardon@google.com>, " =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= " <rkrcmar@redhat.com> Cc: linux-mm@kvack.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt <shakeelb@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[RFC] mm, kvm: account kvm_vcpu_mmap to kmemcg \| expand [RFC] mm, kvm: account kvm_vcpu_mmap to kmemcg

Shakeel Butt March 29, 2019, 1:28 a.m. UTC

A VCPU of a VM can allocate upto three pages which can be mmap'ed by the
user space application. At the moment this memory is not charged. On a
large machine running large number of VMs (or small number of VMs having
large number of VCPUs), this unaccounted memory can be very significant.
So, this memory should be charged to a kmemcg. However that is not
possible as these pages are mmapped to the userspace and PageKmemcg()
was designed with the assumption that such pages will never be mmapped
to the userspace.

One way to solve this problem is by introducing an additional memcg
charging API similar to mem_cgroup_[un]charge_skmem(). However skmem
charging API usage is contained and shared and no new users are
expected but the pages which can be mmapped and should be charged to
kmemcg can and will increase. So, requiring the usage for such API will
increase the maintenance burden. The simplest solution is to remove the
assumption of no mmapping PageKmemcg() pages to user space.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 arch/s390/kvm/kvm-s390.c       |  2 +-
 arch/x86/kvm/x86.c             |  2 +-
 include/linux/page-flags.h     | 26 ++++++++++++++++++--------
 include/trace/events/mmflags.h |  9 ++++++++-
 virt/kvm/coalesced_mmio.c      |  2 +-
 virt/kvm/kvm_main.c            |  2 +-
 6 files changed, 30 insertions(+), 13 deletions(-)

Matthew Wilcox March 29, 2019, 2:35 a.m. UTC | #1

On Thu, Mar 28, 2019 at 06:28:36PM -0700, Shakeel Butt wrote:
> A VCPU of a VM can allocate upto three pages which can be mmap'ed by the
> user space application. At the moment this memory is not charged. On a
> large machine running large number of VMs (or small number of VMs having
> large number of VCPUs), this unaccounted memory can be very significant.
> So, this memory should be charged to a kmemcg. However that is not
> possible as these pages are mmapped to the userspace and PageKmemcg()
> was designed with the assumption that such pages will never be mmapped
> to the userspace.
> 
> One way to solve this problem is by introducing an additional memcg
> charging API similar to mem_cgroup_[un]charge_skmem(). However skmem
> charging API usage is contained and shared and no new users are
> expected but the pages which can be mmapped and should be charged to
> kmemcg can and will increase. So, requiring the usage for such API will
> increase the maintenance burden. The simplest solution is to remove the
> assumption of no mmapping PageKmemcg() pages to user space.

The usual response under these circumstances is "No, you can't have a
page flag bit".

I don't understand why we need a PageKmemcg anyway.  We already
have an entire pointer in struct page; can we not just check whether
page->mem_cgroup is NULL or not?

Shakeel Butt March 29, 2019, 3:59 a.m. UTC | #2

On Thu, Mar 28, 2019 at 7:36 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 28, 2019 at 06:28:36PM -0700, Shakeel Butt wrote:
> > A VCPU of a VM can allocate upto three pages which can be mmap'ed by the
> > user space application. At the moment this memory is not charged. On a
> > large machine running large number of VMs (or small number of VMs having
> > large number of VCPUs), this unaccounted memory can be very significant.
> > So, this memory should be charged to a kmemcg. However that is not
> > possible as these pages are mmapped to the userspace and PageKmemcg()
> > was designed with the assumption that such pages will never be mmapped
> > to the userspace.
> >
> > One way to solve this problem is by introducing an additional memcg
> > charging API similar to mem_cgroup_[un]charge_skmem(). However skmem
> > charging API usage is contained and shared and no new users are
> > expected but the pages which can be mmapped and should be charged to
> > kmemcg can and will increase. So, requiring the usage for such API will
> > increase the maintenance burden. The simplest solution is to remove the
> > assumption of no mmapping PageKmemcg() pages to user space.
>
> The usual response under these circumstances is "No, you can't have a
> page flag bit".
>

I would say for systems having CONFIG_MEMCG_KMEM, a page flag bit is
not that expensive.

> I don't understand why we need a PageKmemcg anyway.  We already
> have an entire pointer in struct page; can we not just check whether
> page->mem_cgroup is NULL or not?

PageKmemcg is for kmem while page->mem_cgroup is used for anon, file
and kmem memory. So, page->mem_cgroup can not be used for NULL check
unless we unify them. Not sure how complicated would that be.

Shakeel

Michal Hocko March 29, 2019, 7:52 a.m. UTC | #3

On Thu 28-03-19 18:28:36, Shakeel Butt wrote:
> A VCPU of a VM can allocate upto three pages which can be mmap'ed by the
> user space application. At the moment this memory is not charged. On a
> large machine running large number of VMs (or small number of VMs having
> large number of VCPUs), this unaccounted memory can be very significant.

Is this really the case. How many machines are we talking about? Say I
have a virtual machines with 1K cpus, this will result in 12MB. Is this
significant to the overal size of the virtual machine to even care?

> So, this memory should be charged to a kmemcg. However that is not
> possible as these pages are mmapped to the userspace and PageKmemcg()
> was designed with the assumption that such pages will never be mmapped
> to the userspace.
>
> One way to solve this problem is by introducing an additional memcg
> charging API similar to mem_cgroup_[un]charge_skmem(). However skmem
> charging API usage is contained and shared and no new users are
> expected but the pages which can be mmapped and should be charged to
> kmemcg can and will increase. So, requiring the usage for such API will
> increase the maintenance burden. The simplest solution is to remove the
> assumption of no mmapping PageKmemcg() pages to user space.

IIRC the only purpose of PageKmemcg is to keep accounting in the legacy
memcg right. Spending a page flag for that is just no-go. If PageKmemcg
cannot reuse mapping then we have to find a better place for it (e.g.
bottom bit in the page->memcg pointer or rethink the whole PageKmemcg.
 
> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  arch/s390/kvm/kvm-s390.c       |  2 +-
>  arch/x86/kvm/x86.c             |  2 +-
>  include/linux/page-flags.h     | 26 ++++++++++++++++++--------
>  include/trace/events/mmflags.h |  9 ++++++++-
>  virt/kvm/coalesced_mmio.c      |  2 +-
>  virt/kvm/kvm_main.c            |  2 +-
>  6 files changed, 30 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index 4638303ba6a8..1a9e337ed5da 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -2953,7 +2953,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
>  		goto out;
>  
>  	BUILD_BUG_ON(sizeof(struct sie_page) != 4096);
> -	sie_page = (struct sie_page *) get_zeroed_page(GFP_KERNEL);
> +	sie_page = (struct sie_page *) get_zeroed_page(GFP_KERNEL_ACCOUNT);
>  	if (!sie_page)
>  		goto out_free_cpu;
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 65e4559eef2f..05c0c7eaa5c6 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9019,7 +9019,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  	else
>  		vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
>  
> -	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!page) {
>  		r = -ENOMEM;
>  		goto fail;
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 9f8712a4b1a5..b47a6a327d6a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -78,6 +78,10 @@
>   * PG_hwpoison indicates that a page got corrupted in hardware and contains
>   * data with incorrect ECC bits that triggered a machine check. Accessing is
>   * not safe since it may cause another machine check. Don't touch!
> + *
> + * PG_kmemcg indicates that a kmem page is charged to a memcg. If kmemcg is
> + * enabled, the page allocator will set PageKmemcg() on  pages allocated with
> + * __GFP_ACCOUNT. It gets cleared on page free.
>   */
>  
>  /*
> @@ -130,6 +134,9 @@ enum pageflags {
>  #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
>  	PG_young,
>  	PG_idle,
> +#endif
> +#ifdef CONFIG_MEMCG_KMEM
> +	PG_kmemcg,
>  #endif
>  	__NR_PAGEFLAGS,
>  
> @@ -289,6 +296,9 @@ static inline int Page##uname(const struct page *page) { return 0; }
>  #define SETPAGEFLAG_NOOP(uname)						\
>  static inline void SetPage##uname(struct page *page) {  }
>  
> +#define __SETPAGEFLAG_NOOP(uname)					\
> +static inline void __SetPage##uname(struct page *page) {  }
> +
>  #define CLEARPAGEFLAG_NOOP(uname)					\
>  static inline void ClearPage##uname(struct page *page) {  }
>  
> @@ -427,6 +437,13 @@ TESTCLEARFLAG(Young, young, PF_ANY)
>  PAGEFLAG(Idle, idle, PF_ANY)
>  #endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +__PAGEFLAG(Kmemcg, kmemcg, PF_NO_TAIL)
> +#else
> +TESTPAGEFLAG_FALSE(kmemcg)
> +__SETPAGEFLAG_NOOP(kmemcg)
> +__CLEARPAGEFLAG_NOOP(kmemcg)
> +#endif
>  /*
>   * On an anonymous page mapped into a user virtual memory area,
>   * page->mapping points to its anon_vma, not to a struct address_space;
> @@ -701,8 +718,7 @@ PAGEFLAG_FALSE(DoubleMap)
>  #define PAGE_MAPCOUNT_RESERVE	-128
>  #define PG_buddy	0x00000080
>  #define PG_offline	0x00000100
> -#define PG_kmemcg	0x00000200
> -#define PG_table	0x00000400
> +#define PG_table	0x00000200
>  
>  #define PageType(page, flag)						\
>  	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -743,12 +759,6 @@ PAGE_TYPE_OPS(Buddy, buddy)
>   */
>  PAGE_TYPE_OPS(Offline, offline)
>  
> -/*
> - * If kmemcg is enabled, the buddy allocator will set PageKmemcg() on
> - * pages allocated with __GFP_ACCOUNT. It gets cleared on page free.
> - */
> -PAGE_TYPE_OPS(Kmemcg, kmemcg)
> -
>  /*
>   * Marks pages in use as page tables.
>   */
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index a1675d43777e..d93b78eac5b9 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -79,6 +79,12 @@
>  #define IF_HAVE_PG_IDLE(flag,string)
>  #endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +#define IF_HAVE_PG_KMEMCG(flag,string) ,{1UL << flag, string}
> +#else
> +#define IF_HAVE_PG_KMEMCG(flag,string)
> +#endif
> +
>  #define __def_pageflag_names						\
>  	{1UL << PG_locked,		"locked"	},		\
>  	{1UL << PG_waiters,		"waiters"	},		\
> @@ -105,7 +111,8 @@ IF_HAVE_PG_MLOCK(PG_mlocked,		"mlocked"	)		\
>  IF_HAVE_PG_UNCACHED(PG_uncached,	"uncached"	)		\
>  IF_HAVE_PG_HWPOISON(PG_hwpoison,	"hwpoison"	)		\
>  IF_HAVE_PG_IDLE(PG_young,		"young"		)		\
> -IF_HAVE_PG_IDLE(PG_idle,		"idle"		)
> +IF_HAVE_PG_IDLE(PG_idle,		"idle"		)		\
> +IF_HAVE_PG_KMEMCG(PG_kmemcg,		"kmemcg"	)
>  
>  #define show_page_flags(flags)						\
>  	(flags) ? __print_flags(flags, "|",				\
> diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
> index 5294abb3f178..ebf1601de2a5 100644
> --- a/virt/kvm/coalesced_mmio.c
> +++ b/virt/kvm/coalesced_mmio.c
> @@ -110,7 +110,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm)
>  	int ret;
>  
>  	ret = -ENOMEM;
> -	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!page)
>  		goto out_err;
>  
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f25aa98a94df..de6328dff251 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -306,7 +306,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
>  	vcpu->pre_pcpu = -1;
>  	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
>  
> -	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> +	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
>  	if (!page) {
>  		r = -ENOMEM;
>  		goto fail;
> -- 
> 2.21.0.392.gf8f6787159e-goog
>

Johannes Weiner March 29, 2019, 2:20 p.m. UTC | #4

On Thu, Mar 28, 2019 at 08:59:45PM -0700, Shakeel Butt wrote:
> On Thu, Mar 28, 2019 at 7:36 PM Matthew Wilcox <willy@infradead.org> wrote:
> > I don't understand why we need a PageKmemcg anyway.  We already
> > have an entire pointer in struct page; can we not just check whether
> > page->mem_cgroup is NULL or not?
> 
> PageKmemcg is for kmem while page->mem_cgroup is used for anon, file
> and kmem memory. So, page->mem_cgroup can not be used for NULL check
> unless we unify them. Not sure how complicated would that be.

A page flag warrants research into this.

The only reason we have PageKmemcg() is because of the way we do
memory type accounting at uncharge time:

	if (!PageKmemcg(page)) {
		unsigned int nr_pages = 1;

		if (PageTransHuge(page)) {
			nr_pages <<= compound_order(page);
			ug->nr_huge += nr_pages;
		}
		if (PageAnon(page))
			ug->nr_anon += nr_pages;
		else {
			ug->nr_file += nr_pages;
			if (PageSwapBacked(page))
				ug->nr_shmem += nr_pages;
		}
		ug->pgpgout++;
	} else {
		ug->nr_kmem += 1 << compound_order(page);
		__ClearPageKmemcg(page);
	}

	[...]

	if (!mem_cgroup_is_root(ug->memcg)) {
		page_counter_uncharge(&ug->memcg->memory, nr_pages);
		if (do_memsw_account())
			page_counter_uncharge(&ug->memcg->memsw, nr_pages);
		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
			page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
		memcg_oom_recover(ug->memcg);
	}

	local_irq_save(flags);
	__mod_memcg_state(ug->memcg, MEMCG_RSS, -ug->nr_anon);
	__mod_memcg_state(ug->memcg, MEMCG_CACHE, -ug->nr_file);
	__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
	__mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);

But nothing says we have to have all these duplicate private counters,
or update them this late in the page's lifetime. The generic vmstat
counters in comparison are updated when 1) we know the page is going
away but 2) we still know the page type. We can do the same here.

We can either

a) Push up the MEMCG_RSS, MEMCG_CACHE etc. accounting sites to before
   the pages are uncharged, when the page type is still known, but
   page->mem_cgroup is exclusive, i.e. when they are deleted from page
   cache or when their last pte is going away. This would be very
   close to where the VM updates NR_ANON_MAPPED, NR_FILE_PAGES etc.

or

b) Tweak the existing NR_ANON_MAPPED, NR_FILE_PAGES, NR_ANON_THPS
   accounting sites to use the lruvec_page_state infra and get rid of
   the duplicate MEMCG_RSS, MEMCG_CACHE counters completely.

   These sites would need slight adjustments, as they are sometimes
   before commit_charge() set up page->mem_cgroup, but it doesn't look
   too complicated to fix that ordering.

The latter would be a great cleanup, and frankly one that is long
overdue. There is no good reason for all this duplication. We'd not
only get rid of the private counters and the duplicate accounting
sites, it would drastically simplify charging and uncharging, and it
would even obviate the need for a separate kmem (un)charge path.

[ The cgroup1 memcg->kmem thing is the odd-one-out, but I think this
  is complete legacy at this point and nobody is actively setting
  limits on that counter anyway. We can break out an explicit v1-only
  mem_cgroup_charge_legacy_kmem(), put it into the currently accounted
  callsites for compatibility, and not add any new ones. ]

Shakeel Butt March 29, 2019, 4 p.m. UTC | #5

On Fri, Mar 29, 2019 at 12:52 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 28-03-19 18:28:36, Shakeel Butt wrote:
> > A VCPU of a VM can allocate upto three pages which can be mmap'ed by the
> > user space application. At the moment this memory is not charged. On a
> > large machine running large number of VMs (or small number of VMs having
> > large number of VCPUs), this unaccounted memory can be very significant.
>
> Is this really the case. How many machines are we talking about? Say I
> have a virtual machines with 1K cpus, this will result in 12MB. Is this
> significant to the overal size of the virtual machine to even care?
>

Think of having ~1K VMs having 100s of vcpus and the page size can be
larger than 4k. This is not something happening now but we are moving
in that direction. Also

> > So, this memory should be charged to a kmemcg. However that is not
> > possible as these pages are mmapped to the userspace and PageKmemcg()
> > was designed with the assumption that such pages will never be mmapped
> > to the userspace.
> >
> > One way to solve this problem is by introducing an additional memcg
> > charging API similar to mem_cgroup_[un]charge_skmem(). However skmem
> > charging API usage is contained and shared and no new users are
> > expected but the pages which can be mmapped and should be charged to
> > kmemcg can and will increase. So, requiring the usage for such API will
> > increase the maintenance burden. The simplest solution is to remove the
> > assumption of no mmapping PageKmemcg() pages to user space.
>
> IIRC the only purpose of PageKmemcg is to keep accounting in the legacy
> memcg right. Spending a page flag for that is just no-go.

PgaeKmemcg is used for both v1 and v2.

> If PageKmemcg
> cannot reuse mapping then we have to find a better place for it (e.g.
> bottom bit in the page->memcg pointer or rethink the whole PageKmemcg.
>

Johannes have proposal, I will look more into those.

Shakeel

[RFC] mm, kvm: account kvm_vcpu_mmap to kmemcg

Commit Message

Comments

Patch