diff mbox series

[RFC,v2,11/21] kvm: allocate page table pages from DRAM

Message ID 20181226133351.703380444@intel.com (mailing list archive)
State New, archived
Headers show
Series PMEM NUMA node and hotness accounting/migration | expand

Commit Message

Fengguang Wu Dec. 26, 2018, 1:14 p.m. UTC
From: Yao Yuan <yuan.yao@intel.com>

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
arch/x86/kvm/mmu.c |   12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Aneesh Kumar K.V Jan. 1, 2019, 9:23 a.m. UTC | #1
Fengguang Wu <fengguang.wu@intel.com> writes:

> From: Yao Yuan <yuan.yao@intel.com>
>
> Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
> arch/x86/kvm/mmu.c |   12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
>  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
>  }
>  
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}
> +

May be it is explained in other patches. What is preventing the
allocation from pmem here? Is it that we are not using the memory
policy prefered node id and hence the zone list we built won't have the
PMEM node?


>  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
>  				       int min)
>  {
> @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
>  	if (cache->nobjs >= min)
>  		return 0;
>  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
>  		if (!page)
>  			return cache->nobjs >= min ? 0 : -ENOMEM;
>  		cache->objects[cache->nobjs++] = page;

-aneesh
Yuan Yao Jan. 2, 2019, 12:59 a.m. UTC | #2
On Tue, Jan 01, 2019 at 02:53:07PM +0530, Aneesh Kumar K.V wrote:
> Fengguang Wu <fengguang.wu@intel.com> writes:
> 
> > From: Yao Yuan <yuan.yao@intel.com>
> >
> > Signed-off-by: Yao Yuan <yuan.yao@intel.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> > ---
> > arch/x86/kvm/mmu.c |   12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > --- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
> > +++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
> > @@ -950,6 +950,16 @@ static void mmu_free_memory_cache(struct
> >  		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
> >  }
> >  
> > +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> > +{
> > +       struct page *page;
> > +
> > +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> > +       if (!page)
> > +	       return 0;
> > +       return (unsigned long) page_address(page);
> > +}
> > +
> 
> May be it is explained in other patches. What is preventing the
> allocation from pmem here? Is it that we are not using the memory
> policy prefered node id and hence the zone list we built won't have the
> PMEM node?

That because the PMEM nodes are memory-only node in the patchset,
so numa_node_id() will always return the node id from DRAM nodes.

About the zone list, yes in patch 10/21 we build the PMEM nodes to
seperate zonelist, so DRAM nodes will not fall back to PMEM nodes.

> 
> >  static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
> >  				       int min)
> >  {
> > @@ -958,7 +968,7 @@ static int mmu_topup_memory_cache_page(s
> >  	if (cache->nobjs >= min)
> >  		return 0;
> >  	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
> > -		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
> > +		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
> >  		if (!page)
> >  			return cache->nobjs >= min ? 0 : -ENOMEM;
> >  		cache->objects[cache->nobjs++] = page;
> 
> -aneesh
>
Dave Hansen Jan. 2, 2019, 4:47 p.m. UTC | #3
On 12/26/18 5:14 AM, Fengguang Wu wrote:
> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
> +{
> +       struct page *page;
> +
> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
> +       if (!page)
> +	       return 0;
> +       return (unsigned long) page_address(page);
> +}

There seems to be a ton of *policy* baked into these patches.  For
instance: thou shalt not allocate page tables pages from PMEM.  That's
surely not a policy we want to inflict on every Linux user until the end
of time.

I think the more important question is how we can have the specific
policy that this patch implements, but also leave open room for other
policies, such as: "I don't care how slow this VM runs, minimize the
amount of fast memory it eats."
Fengguang Wu Jan. 7, 2019, 10:21 a.m. UTC | #4
On Wed, Jan 02, 2019 at 08:47:25AM -0800, Dave Hansen wrote:
>On 12/26/18 5:14 AM, Fengguang Wu wrote:
>> +static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
>> +{
>> +       struct page *page;
>> +
>> +       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
>> +       if (!page)
>> +	       return 0;
>> +       return (unsigned long) page_address(page);
>> +}
>
>There seems to be a ton of *policy* baked into these patches.  For
>instance: thou shalt not allocate page tables pages from PMEM.  That's
>surely not a policy we want to inflict on every Linux user until the end
>of time.

Right. It's straight forward policy for users that care performance.
The project is planned by 3 steps, at this moment we are in phase (1):

1) core functionalities, easy to backport
2) upstream-able total solution
3) upstream when API stabilized

The dumb kernel interface /proc/PID/idle_pages enables doing
the majority policies in user space. However for the other smaller
parts, it looks easier to just implement an obvious policy first.
Then to consider more possibilities.

>I think the more important question is how we can have the specific
>policy that this patch implements, but also leave open room for other
>policies, such as: "I don't care how slow this VM runs, minimize the
>amount of fast memory it eats."

Agreed. I'm open for more ways. We can treat these patches as the
soliciting version. If anyone send reasonable improvements or even
totally different way of doing it, I'd be happy to incorporate.

Thanks,
Fengguang
diff mbox series

Patch

--- linux.orig/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.846720344 +0800
+++ linux/arch/x86/kvm/mmu.c	2018-12-26 20:54:48.842719614 +0800
@@ -950,6 +950,16 @@  static void mmu_free_memory_cache(struct
 		kmem_cache_free(cache, mc->objects[--mc->nobjs]);
 }
 
+static unsigned long __get_dram_free_pages(gfp_t gfp_mask)
+{
+       struct page *page;
+
+       page = __alloc_pages(GFP_KERNEL_ACCOUNT, 0, numa_node_id());
+       if (!page)
+	       return 0;
+       return (unsigned long) page_address(page);
+}
+
 static int mmu_topup_memory_cache_page(struct kvm_mmu_memory_cache *cache,
 				       int min)
 {
@@ -958,7 +968,7 @@  static int mmu_topup_memory_cache_page(s
 	if (cache->nobjs >= min)
 		return 0;
 	while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
-		page = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
+		page = (void *)__get_dram_free_pages(GFP_KERNEL_ACCOUNT);
 		if (!page)
 			return cache->nobjs >= min ? 0 : -ENOMEM;
 		cache->objects[cache->nobjs++] = page;