Message ID | 20220628220938.3657876-2-yosryahmed@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: mm: count KVM mmu usage in memory stats | expand |
On Tue, 28 Jun 2022 23:09:35 +0100, Yosry Ahmed <yosryahmed@google.com> wrote: > > We keep track of several kernel memory stats (total kernel memory, page > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > per-memcg, etc). These stats give insights to users to how much memory > is used by the kernel and for what purposes. > > Currently, memory used by kvm mmu is not accounted in any of those > kernel memory stats. This patch series accounts the memory pages > used by KVM for page tables in those stats in a new > NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account > for other types of secondary pages tables (e.g. iommu page tables). > > KVM has a decent number of large allocations that aren't for page > tables, but for most of them, the number/size of those allocations > scales linearly with either the number of vCPUs or the amount of memory > assigned to the VM. KVM's secondary page table allocations do not scale > linearly, especially when nested virtualization is in use. > > From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's > per-VM pages_{4k,2m,1g} stats unless the guest is doing something > bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is > forced to allocate a large number of page tables even though the guest > isn't accessing that much memory). However, someone would need to either > understand how KVM works to make that connection, or know (or be told) to > go look at KVM's stats if they're running VMs to better decipher the stats. > > Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE > is informative. For example, when backing a VM with THP vs. HugeTLB, > NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order > of magnitude higher with THP. So having this stat will at the very least > prove to be useful for understanding tradeoffs between VM backing types, > and likely even steer folks towards potential optimizations. > > The original discussion with more details about the rationale: > https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org > > This stat will be used by subsequent patches to count KVM mmu > memory usage. > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Marc Zyngier <maz@kernel.org> M.
On Tue, Jun 28, 2022, Yosry Ahmed wrote: > We keep track of several kernel memory stats (total kernel memory, page > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > per-memcg, etc). These stats give insights to users to how much memory > is used by the kernel and for what purposes. > > Currently, memory used by kvm mmu is not accounted in any of those Nit, capitalize KVM (mainly to be consistent). > @@ -1085,6 +1086,9 @@ KernelStack > Memory consumed by the kernel stacks of all tasks > PageTables > Memory consumed by userspace page tables > +SecPageTables > + Memory consumed by secondary page tables, this currently > + currently includes KVM mmu allocations on x86 and arm64. Nit, this line has a tab instead of eight spaces. Not sure if it actually matters, there are plenty of tabs elsewhere in the file, but all the entries in this block use only spaces. > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index aab70355d64f3..13190d298c986 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -216,6 +216,7 @@ enum node_stat_item { > NR_KERNEL_SCS_KB, /* measured in KiB */ > #endif > NR_PAGETABLE, /* used for pagetables */ > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" is messy, so I totally understand why you included it, but in this case it's unnecessary and potentially confusing. And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM (using IS_ENABLED() because KVM can be built as a module)? That could be removed if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for stats the depend on a single feature seems to be the status quo for this code. > #ifdef CONFIG_SWAP > NR_SWAPCACHE, > #endif
Thanks for taking another look at this! On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > We keep track of several kernel memory stats (total kernel memory, page > > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > > per-memcg, etc). These stats give insights to users to how much memory > > is used by the kernel and for what purposes. > > > > Currently, memory used by kvm mmu is not accounted in any of those > > Nit, capitalize KVM (mainly to be consistent). > > > @@ -1085,6 +1086,9 @@ KernelStack > > Memory consumed by the kernel stacks of all tasks > > PageTables > > Memory consumed by userspace page tables > > +SecPageTables > > + Memory consumed by secondary page tables, this currently > > + currently includes KVM mmu allocations on x86 and arm64. > > Nit, this line has a tab instead of eight spaces. Not sure if it actually matters, > there are plenty of tabs elsewhere in the file, but all the entries in this block > use only spaces. > Will fix it. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index aab70355d64f3..13190d298c986 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -216,6 +216,7 @@ enum node_stat_item { > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > #endif > > NR_PAGETABLE, /* used for pagetables */ > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > is messy, so I totally understand why you included it, but in this case it's unnecessary > and potentially confusing. > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > stats the depend on a single feature seems to be the status quo for this code. > I will #ifdef the stat, but I will emphasize in the docs that is currently *only* used for KVM so that it makes sense if users without KVM don't see the stat at all. I will also remove the stat from show_free_areas() in mm/page_alloc.c as it seems like none of the #ifdefed stats show up there. > > #ifdef CONFIG_SWAP > > NR_SWAPCACHE, > > #endif
On Tue, Jul 12, 2022, Yosry Ahmed wrote: > Thanks for taking another look at this! > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > index aab70355d64f3..13190d298c986 100644 > > > --- a/include/linux/mmzone.h > > > +++ b/include/linux/mmzone.h > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > #endif > > > NR_PAGETABLE, /* used for pagetables */ > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > and potentially confusing. > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > stats the depend on a single feature seems to be the status quo for this code. > > > > I will #ifdef the stat, but I will emphasize in the docs that is > currently *only* used for KVM so that it makes sense if users without > KVM don't see the stat at all. I will also remove the stat from > show_free_areas() in mm/page_alloc.c as it seems like none of the > #ifdefed stats show up there. It's might be worth getting someone from mm/ to weigh in before going through the trouble, my suggestion/question is based purely on the existing code.
On Tue, Jul 12, 2022 at 4:06 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Jul 12, 2022, Yosry Ahmed wrote: > > Thanks for taking another look at this! > > > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > > index aab70355d64f3..13190d298c986 100644 > > > > --- a/include/linux/mmzone.h > > > > +++ b/include/linux/mmzone.h > > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > > #endif > > > > NR_PAGETABLE, /* used for pagetables */ > > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > > and potentially confusing. > > > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > > stats the depend on a single feature seems to be the status quo for this code. > > > > > > > I will #ifdef the stat, but I will emphasize in the docs that is > > currently *only* used for KVM so that it makes sense if users without > > KVM don't see the stat at all. I will also remove the stat from > > show_free_areas() in mm/page_alloc.c as it seems like none of the > > #ifdefed stats show up there. > > It's might be worth getting someone from mm/ to weigh in before going through the > trouble, my suggestion/question is based purely on the existing code. Any mm folks with an opinion about this? Any preference on whether we should wrap NR_SECONDARY_PAGETABLE stats with #ifdef CONFIG_KVM for now as it is currently the only source for this stat?
On Mon, Jul 18, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Tue, Jul 12, 2022 at 4:06 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Tue, Jul 12, 2022, Yosry Ahmed wrote: > > > Thanks for taking another look at this! > > > > > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > > > index aab70355d64f3..13190d298c986 100644 > > > > > --- a/include/linux/mmzone.h > > > > > +++ b/include/linux/mmzone.h > > > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > > > #endif > > > > > NR_PAGETABLE, /* used for pagetables */ > > > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > > > and potentially confusing. > > > > > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > > > stats the depend on a single feature seems to be the status quo for this code. > > > > > > > > > > I will #ifdef the stat, but I will emphasize in the docs that is > > > currently *only* used for KVM so that it makes sense if users without > > > KVM don't see the stat at all. I will also remove the stat from > > > show_free_areas() in mm/page_alloc.c as it seems like none of the > > > #ifdefed stats show up there. > > > > It's might be worth getting someone from mm/ to weigh in before going through the > > trouble, my suggestion/question is based purely on the existing code. > > Any mm folks with an opinion about this? > > Any preference on whether we should wrap NR_SECONDARY_PAGETABLE stats > with #ifdef CONFIG_KVM for now as it is currently the only source for > this stat? Any input here? Johannes, you have been involved in discussions in earlier versions of this series, any thoughts here?
On Mon, Aug 8, 2022 at 1:06 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Jul 18, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jul 12, 2022 at 4:06 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Jul 12, 2022, Yosry Ahmed wrote: > > > > Thanks for taking another look at this! > > > > > > > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > > > > index aab70355d64f3..13190d298c986 100644 > > > > > > --- a/include/linux/mmzone.h > > > > > > +++ b/include/linux/mmzone.h > > > > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > > > > #endif > > > > > > NR_PAGETABLE, /* used for pagetables */ > > > > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > > > > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > > > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > > > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > > > > and potentially confusing. > > > > > > > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > > > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > > > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > > > > stats the depend on a single feature seems to be the status quo for this code. > > > > > > > > > > > > > I will #ifdef the stat, but I will emphasize in the docs that is > > > > currently *only* used for KVM so that it makes sense if users without > > > > KVM don't see the stat at all. I will also remove the stat from > > > > show_free_areas() in mm/page_alloc.c as it seems like none of the > > > > #ifdefed stats show up there. > > > > > > It's might be worth getting someone from mm/ to weigh in before going through the > > > trouble, my suggestion/question is based purely on the existing code. > > > > Any mm folks with an opinion about this? > > > > Any preference on whether we should wrap NR_SECONDARY_PAGETABLE stats > > with #ifdef CONFIG_KVM for now as it is currently the only source for > > this stat? > > Any input here? > > Johannes, you have been involved in discussions in earlier versions of > this series, any thoughts here? Andrew, do you have an opinion on this? If not, I will send a v7 with the nits discussed with Sean. I think otherwise this series has sufficient ACKs. Would this be merged through the mm tree or kvm tree? This was based on the kvm/queue branch but I think I can rebase it on top of mm-unstable, I think all dependencies that this would have added in kvm/queue would have been fanned to mm by now.
On Mon, Aug 08, 2022 at 01:06:15PM -0700, Yosry Ahmed wrote: > On Mon, Jul 18, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Jul 12, 2022 at 4:06 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Tue, Jul 12, 2022, Yosry Ahmed wrote: > > > > Thanks for taking another look at this! > > > > > > > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > > > > index aab70355d64f3..13190d298c986 100644 > > > > > > --- a/include/linux/mmzone.h > > > > > > +++ b/include/linux/mmzone.h > > > > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > > > > #endif > > > > > > NR_PAGETABLE, /* used for pagetables */ > > > > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > > > > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > > > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > > > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > > > > and potentially confusing. > > > > > > > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > > > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > > > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > > > > stats the depend on a single feature seems to be the status quo for this code. > > > > > > > > > > > > > I will #ifdef the stat, but I will emphasize in the docs that is > > > > currently *only* used for KVM so that it makes sense if users without > > > > KVM don't see the stat at all. I will also remove the stat from > > > > show_free_areas() in mm/page_alloc.c as it seems like none of the > > > > #ifdefed stats show up there. > > > > > > It's might be worth getting someone from mm/ to weigh in before going through the > > > trouble, my suggestion/question is based purely on the existing code. > > > > Any mm folks with an opinion about this? > > > > Any preference on whether we should wrap NR_SECONDARY_PAGETABLE stats > > with #ifdef CONFIG_KVM for now as it is currently the only source for > > this stat? > > Any input here? > > Johannes, you have been involved in discussions in earlier versions of > this series, any thoughts here? No super strong feelings here. Most major distros have CONFIG_KVM=y/n, so it'll be a common fixture anyway, and the ifdef is proooobably not worth it for hiding it from people. OTOH, the ifdef is useful for documenting the code. If you've already ifdeffed it now, I'd say go ahead with it. Otherwise, don't :) My 2c.
On Mon, Aug 15, 2022 at 8:13 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Aug 08, 2022 at 01:06:15PM -0700, Yosry Ahmed wrote: > > On Mon, Jul 18, 2022 at 11:26 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > On Tue, Jul 12, 2022 at 4:06 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > On Tue, Jul 12, 2022, Yosry Ahmed wrote: > > > > > Thanks for taking another look at this! > > > > > > > > > > On Thu, Jul 7, 2022 at 1:59 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > > > > > > > On Tue, Jun 28, 2022, Yosry Ahmed wrote: > > > > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > > > > > > index aab70355d64f3..13190d298c986 100644 > > > > > > > --- a/include/linux/mmzone.h > > > > > > > +++ b/include/linux/mmzone.h > > > > > > > @@ -216,6 +216,7 @@ enum node_stat_item { > > > > > > > NR_KERNEL_SCS_KB, /* measured in KiB */ > > > > > > > #endif > > > > > > > NR_PAGETABLE, /* used for pagetables */ > > > > > > > + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ > > > > > > > > > > > > Nit, s/kvm/KVM, and drop the "shadow", which might be misinterpreted as saying KVM > > > > > > pagetables are only accounted when KVM is using shadow paging. KVM's usage of "shadow" > > > > > > is messy, so I totally understand why you included it, but in this case it's unnecessary > > > > > > and potentially confusing. > > > > > > > > > > > > And finally, something that's not a nit. Should this be wrapped with CONFIG_KVM > > > > > > (using IS_ENABLED() because KVM can be built as a module)? That could be removed > > > > > > if another non-KVM secondary MMU user comes along, but until then, #ifdeffery for > > > > > > stats the depend on a single feature seems to be the status quo for this code. > > > > > > > > > > > > > > > > I will #ifdef the stat, but I will emphasize in the docs that is > > > > > currently *only* used for KVM so that it makes sense if users without > > > > > KVM don't see the stat at all. I will also remove the stat from > > > > > show_free_areas() in mm/page_alloc.c as it seems like none of the > > > > > #ifdefed stats show up there. > > > > > > > > It's might be worth getting someone from mm/ to weigh in before going through the > > > > trouble, my suggestion/question is based purely on the existing code. > > > > > > Any mm folks with an opinion about this? > > > > > > Any preference on whether we should wrap NR_SECONDARY_PAGETABLE stats > > > with #ifdef CONFIG_KVM for now as it is currently the only source for > > > this stat? > > > > Any input here? > > > > Johannes, you have been involved in discussions in earlier versions of > > this series, any thoughts here? > > No super strong feelings here. Most major distros have CONFIG_KVM=y/n, > so it'll be a common fixture anyway, and the ifdef is proooobably not > worth it for hiding it from people. OTOH, the ifdef is useful for > documenting the code. > > If you've already ifdeffed it now, I'd say go ahead with > it. Otherwise, don't :) My 2c. Thanks a lot, Johannes! I haven't ifdeffed it yet so I'll send a v7 with a few nits and collect ACKs. Andrew, would you prefer me to rebase on top of mm-unstable? Or will this go in through the kvm tree? (currently it's based on an old-ish kvm/queue).
On Tue, 28 Jun 2022 22:09:35 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > We keep track of several kernel memory stats (total kernel memory, page > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > per-memcg, etc). These stats give insights to users to how much memory > is used by the kernel and for what purposes. > > Currently, memory used by kvm mmu is not accounted in any of those > kernel memory stats. This patch series accounts the memory pages > used by KVM for page tables in those stats in a new > NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account > for other types of secondary pages tables (e.g. iommu page tables). > > KVM has a decent number of large allocations that aren't for page > tables, but for most of them, the number/size of those allocations > scales linearly with either the number of vCPUs or the amount of memory > assigned to the VM. KVM's secondary page table allocations do not scale > linearly, especially when nested virtualization is in use. > > >From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's > per-VM pages_{4k,2m,1g} stats unless the guest is doing something > bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is > forced to allocate a large number of page tables even though the guest > isn't accessing that much memory). However, someone would need to either > understand how KVM works to make that connection, or know (or be told) to > go look at KVM's stats if they're running VMs to better decipher the stats. > > Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE > is informative. For example, when backing a VM with THP vs. HugeTLB, > NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order > of magnitude higher with THP. So having this stat will at the very least > prove to be useful for understanding tradeoffs between VM backing types, > and likely even steer folks towards potential optimizations. > > The original discussion with more details about the rationale: > https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org > > This stat will be used by subsequent patches to count KVM mmu > memory usage. Nits and triviata: > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -977,6 +977,7 @@ Example output. You may not have all of these fields. > SUnreclaim: 142336 kB > KernelStack: 11168 kB > PageTables: 20540 kB > + SecPageTables: 0 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > @@ -1085,6 +1086,9 @@ KernelStack > Memory consumed by the kernel stacks of all tasks > PageTables > Memory consumed by userspace page tables > +SecPageTables > + Memory consumed by secondary page tables, this currently > + currently includes KVM mmu allocations on x86 and arm64. Something happened to the whitespace there. > + "Node %d SecPageTables: %8lu kB\n" > ... > + nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), The use of "sec" in the user-facing changes and "secondary" in the programmer-facing changes is irksome. Can we be consistent? I'd prefer "secondary" throughout.
On Mon, 15 Aug 2022 08:39:23 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > Thanks a lot, Johannes! I haven't ifdeffed it yet so I'll send a v7 > with a few nits and collect ACKs. Andrew, would you prefer me to > rebase on top of mm-unstable? Or will this go in through the kvm tree? > (currently it's based on an old-ish kvm/queue). Through KVM is OK by me, assuming there'll be ongoing work which is dependent on this.
On Wed, Aug 17, 2022 at 10:24 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Tue, 28 Jun 2022 22:09:35 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > We keep track of several kernel memory stats (total kernel memory, page > > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > > per-memcg, etc). These stats give insights to users to how much memory > > is used by the kernel and for what purposes. > > > > Currently, memory used by kvm mmu is not accounted in any of those > > kernel memory stats. This patch series accounts the memory pages > > used by KVM for page tables in those stats in a new > > NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account > > for other types of secondary pages tables (e.g. iommu page tables). > > > > KVM has a decent number of large allocations that aren't for page > > tables, but for most of them, the number/size of those allocations > > scales linearly with either the number of vCPUs or the amount of memory > > assigned to the VM. KVM's secondary page table allocations do not scale > > linearly, especially when nested virtualization is in use. > > > > >From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's > > per-VM pages_{4k,2m,1g} stats unless the guest is doing something > > bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is > > forced to allocate a large number of page tables even though the guest > > isn't accessing that much memory). However, someone would need to either > > understand how KVM works to make that connection, or know (or be told) to > > go look at KVM's stats if they're running VMs to better decipher the stats. > > > > Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE > > is informative. For example, when backing a VM with THP vs. HugeTLB, > > NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order > > of magnitude higher with THP. So having this stat will at the very least > > prove to be useful for understanding tradeoffs between VM backing types, > > and likely even steer folks towards potential optimizations. > > > > The original discussion with more details about the rationale: > > https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org > > > > This stat will be used by subsequent patches to count KVM mmu > > memory usage. > > Nits and triviata: > > > --- a/Documentation/filesystems/proc.rst > > +++ b/Documentation/filesystems/proc.rst > > @@ -977,6 +977,7 @@ Example output. You may not have all of these fields. > > SUnreclaim: 142336 kB > > KernelStack: 11168 kB > > PageTables: 20540 kB > > + SecPageTables: 0 kB > > NFS_Unstable: 0 kB > > Bounce: 0 kB > > WritebackTmp: 0 kB > > @@ -1085,6 +1086,9 @@ KernelStack > > Memory consumed by the kernel stacks of all tasks > > PageTables > > Memory consumed by userspace page tables > > +SecPageTables > > + Memory consumed by secondary page tables, this currently > > + currently includes KVM mmu allocations on x86 and arm64. > > Something happened to the whitespace there. Yeah I have the fix for this queued for v7. Thanks! > > > + "Node %d SecPageTables: %8lu kB\n" > > ... > > + nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), > > The use of "sec" in the user-facing changes and "secondary" in the > programmer-facing changes is irksome. Can we be consistent? I'd > prefer "secondary" throughout. > SecondaryPageTables is too long (unfortunately), it messes up the formatting in node_read_meminfo() and meminfo_proc_show(). I would prefer "secondary" as well, but I don't know if breaking the format in this way is okay. This is what I mean by breaking the format btw (the numbers become misaligned): diff --git a/drivers/base/node.c b/drivers/base/node.c index 5ad56a0cd593..4f85750a0f8e 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -433,7 +433,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d ShadowCallStack:%8lu kB\n" #endif "Node %d PageTables: %8lu kB\n" - "Node %d SecPageTables: %8lu kB\n" + "Node %d SecondaryPageTables: %8lu kB\n" "Node %d NFS_Unstable: %8lu kB\n" "Node %d Bounce: %8lu kB\n" "Node %d WritebackTmp: %8lu kB\n" diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 208efd4fa52c..b7166d09a38f 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -115,7 +115,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #endif show_val_kb(m, "PageTables: ", global_node_page_state(NR_PAGETABLE)); - show_val_kb(m, "SecPageTables: ", + show_val_kb(m, "SecondaryPageTables: ", global_node_page_state(NR_SECONDARY_PAGETABLE)); show_val_kb(m, "NFS_Unstable: ", 0);
On Wed, Aug 17, 2022 at 3:27 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Wed, Aug 17, 2022 at 10:24 AM Andrew Morton > <akpm@linux-foundation.org> wrote: > > > > On Tue, 28 Jun 2022 22:09:35 +0000 Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > We keep track of several kernel memory stats (total kernel memory, page > > > tables, stack, vmalloc, etc) on multiple levels (global, per-node, > > > per-memcg, etc). These stats give insights to users to how much memory > > > is used by the kernel and for what purposes. > > > > > > Currently, memory used by kvm mmu is not accounted in any of those > > > kernel memory stats. This patch series accounts the memory pages > > > used by KVM for page tables in those stats in a new > > > NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account > > > for other types of secondary pages tables (e.g. iommu page tables). > > > > > > KVM has a decent number of large allocations that aren't for page > > > tables, but for most of them, the number/size of those allocations > > > scales linearly with either the number of vCPUs or the amount of memory > > > assigned to the VM. KVM's secondary page table allocations do not scale > > > linearly, especially when nested virtualization is in use. > > > > > > >From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's > > > per-VM pages_{4k,2m,1g} stats unless the guest is doing something > > > bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is > > > forced to allocate a large number of page tables even though the guest > > > isn't accessing that much memory). However, someone would need to either > > > understand how KVM works to make that connection, or know (or be told) to > > > go look at KVM's stats if they're running VMs to better decipher the stats. > > > > > > Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE > > > is informative. For example, when backing a VM with THP vs. HugeTLB, > > > NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order > > > of magnitude higher with THP. So having this stat will at the very least > > > prove to be useful for understanding tradeoffs between VM backing types, > > > and likely even steer folks towards potential optimizations. > > > > > > The original discussion with more details about the rationale: > > > https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org > > > > > > This stat will be used by subsequent patches to count KVM mmu > > > memory usage. > > > > Nits and triviata: > > > > > --- a/Documentation/filesystems/proc.rst > > > +++ b/Documentation/filesystems/proc.rst > > > @@ -977,6 +977,7 @@ Example output. You may not have all of these fields. > > > SUnreclaim: 142336 kB > > > KernelStack: 11168 kB > > > PageTables: 20540 kB > > > + SecPageTables: 0 kB > > > NFS_Unstable: 0 kB > > > Bounce: 0 kB > > > WritebackTmp: 0 kB > > > @@ -1085,6 +1086,9 @@ KernelStack > > > Memory consumed by the kernel stacks of all tasks > > > PageTables > > > Memory consumed by userspace page tables > > > +SecPageTables > > > + Memory consumed by secondary page tables, this currently > > > + currently includes KVM mmu allocations on x86 and arm64. > > > > Something happened to the whitespace there. > > Yeah I have the fix for this queued for v7. Thanks! > > > > > > + "Node %d SecPageTables: %8lu kB\n" > > > ... > > > + nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), > > > > The use of "sec" in the user-facing changes and "secondary" in the > > programmer-facing changes is irksome. Can we be consistent? I'd > > prefer "secondary" throughout. > > > > SecondaryPageTables is too long (unfortunately), it messes up the > formatting in node_read_meminfo() and meminfo_proc_show(). I would > prefer "secondary" as well, but I don't know if breaking the format in > this way is okay. Any thoughts here Andrew? Change to SecondaryPageTables anyway? Change all to use "sec" instead of "secondary"? Leave as-is? > > This is what I mean by breaking the format btw (the numbers become misaligned): > > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 5ad56a0cd593..4f85750a0f8e 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -433,7 +433,7 @@ static ssize_t node_read_meminfo(struct device *dev, > "Node %d ShadowCallStack:%8lu kB\n" > #endif > "Node %d PageTables: %8lu kB\n" > - "Node %d SecPageTables: %8lu kB\n" > + "Node %d SecondaryPageTables: %8lu kB\n" > "Node %d NFS_Unstable: %8lu kB\n" > "Node %d Bounce: %8lu kB\n" > "Node %d WritebackTmp: %8lu kB\n" > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c > index 208efd4fa52c..b7166d09a38f 100644 > --- a/fs/proc/meminfo.c > +++ b/fs/proc/meminfo.c > @@ -115,7 +115,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v) > #endif > show_val_kb(m, "PageTables: ", > global_node_page_state(NR_PAGETABLE)); > - show_val_kb(m, "SecPageTables: ", > + show_val_kb(m, "SecondaryPageTables: ", > global_node_page_state(NR_SECONDARY_PAGETABLE)); > > show_val_kb(m, "NFS_Unstable: ", 0);
On Mon, 22 Aug 2022 17:04:57 -0700 Yosry Ahmed <yosryahmed@google.com> wrote: > > SecondaryPageTables is too long (unfortunately), it messes up the > > formatting in node_read_meminfo() and meminfo_proc_show(). I would > > prefer "secondary" as well, but I don't know if breaking the format in > > this way is okay. > > Any thoughts here Andrew? Change to SecondaryPageTables anyway? Change > all to use "sec" instead of "secondary"? Leave as-is? Leave as-is, I guess.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 176298f2f4def..e06db032bdbf3 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1340,6 +1340,11 @@ PAGE_SIZE multiple when read back. pagetables Amount of memory allocated for page tables. + sec_pagetables + Amount of memory allocated for secondary page tables, + this currently includes KVM mmu allocations on x86 + and arm64. + percpu (npn) Amount of memory used for storing per-cpu kernel data structures. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 1bc91fb8c321a..aa2a05b585772 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -977,6 +977,7 @@ Example output. You may not have all of these fields. SUnreclaim: 142336 kB KernelStack: 11168 kB PageTables: 20540 kB + SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB @@ -1085,6 +1086,9 @@ KernelStack Memory consumed by the kernel stacks of all tasks PageTables Memory consumed by userspace page tables +SecPageTables + Memory consumed by secondary page tables, this currently + currently includes KVM mmu allocations on x86 and arm64. NFS_Unstable Always zero. Previous counted pages which had been written to the server, but has not been committed to stable storage. diff --git a/drivers/base/node.c b/drivers/base/node.c index 0ac6376ef7a10..5ad56a0cd5937 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -433,6 +433,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d ShadowCallStack:%8lu kB\n" #endif "Node %d PageTables: %8lu kB\n" + "Node %d SecPageTables: %8lu kB\n" "Node %d NFS_Unstable: %8lu kB\n" "Node %d Bounce: %8lu kB\n" "Node %d WritebackTmp: %8lu kB\n" @@ -459,6 +460,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif nid, K(node_page_state(pgdat, NR_PAGETABLE)), + nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), nid, 0UL, nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 6e89f0e2fd20f..208efd4fa52c7 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -115,6 +115,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #endif show_val_kb(m, "PageTables: ", global_node_page_state(NR_PAGETABLE)); + show_val_kb(m, "SecPageTables: ", + global_node_page_state(NR_SECONDARY_PAGETABLE)); show_val_kb(m, "NFS_Unstable: ", 0); show_val_kb(m, "Bounce: ", diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f3..13190d298c986 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -216,6 +216,7 @@ enum node_stat_item { NR_KERNEL_SCS_KB, /* measured in KiB */ #endif NR_PAGETABLE, /* used for pagetables */ + NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. kvm shadow pagetables */ #ifdef CONFIG_SWAP NR_SWAPCACHE, #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index abec50f31fe64..d8178395215d4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1394,6 +1394,7 @@ static const struct memory_stat memory_stats[] = { { "kernel", MEMCG_KMEM }, { "kernel_stack", NR_KERNEL_STACK_KB }, { "pagetables", NR_PAGETABLE }, + { "sec_pagetables", NR_SECONDARY_PAGETABLE }, { "percpu", MEMCG_PERCPU_B }, { "sock", MEMCG_SOCK }, { "vmalloc", MEMCG_VMALLOC }, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e008a3df0485c..41ba8942ccee6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5950,7 +5950,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " active_file:%lu inactive_file:%lu isolated_file:%lu\n" " unevictable:%lu dirty:%lu writeback:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" - " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" + " mapped:%lu shmem:%lu pagetables:%lu\n" + " sec_pagetables:%lu bounce:%lu\n" " kernel_misc_reclaimable:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", global_node_page_state(NR_ACTIVE_ANON), @@ -5967,6 +5968,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) global_node_page_state(NR_FILE_MAPPED), global_node_page_state(NR_SHMEM), global_node_page_state(NR_PAGETABLE), + global_node_page_state(NR_SECONDARY_PAGETABLE), global_zone_page_state(NR_BOUNCE), global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE), global_zone_page_state(NR_FREE_PAGES), @@ -6000,6 +6002,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) " shadow_call_stack:%lukB" #endif " pagetables:%lukB" + " sec_pagetables:%lukB" " all_unreclaimable? %s" "\n", pgdat->node_id, @@ -6025,6 +6028,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) node_page_state(pgdat, NR_KERNEL_SCS_KB), #endif K(node_page_state(pgdat, NR_PAGETABLE)), + K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)), pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ? "yes" : "no"); } diff --git a/mm/vmstat.c b/mm/vmstat.c index 373d2730fcf21..b937eba681d15 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1240,6 +1240,7 @@ const char * const vmstat_text[] = { "nr_shadow_call_stack", #endif "nr_page_table_pages", + "nr_sec_page_table_pages", #ifdef CONFIG_SWAP "nr_swapcached", #endif