[RFC,3/9] x86/mm/cpa: Add grouped page allocations

Message ID	20210505003032.489164-4-rick.p.edgecombe@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=moDn=KA=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7D1A0613CD IronPort-SDR: nm5X8gxXdNNP7f6G+prhrIA0uSqPFkOvYG9tFH2kYZEtJtvC67b5PNlCiRFdppxaB4d3SWjL35 UdotBtviQS1g== IronPort-SDR: Qqa/foM0GySgOOlFhGHAxlMC+r64sCbCcfkREouyUfA0fkoq5KrlRm0fs13S9pF2I66o4T9VQ4 ApIzAmJWkXrA== From: Rick Edgecombe <rick.p.edgecombe@intel.com> To: dave.hansen@intel.com, luto@kernel.org, peterz@infradead.org, linux-mm@kvack.org, x86@kernel.org, akpm@linux-foundation.org, linux-hardening@vger.kernel.org, kernel-hardening@lists.openwall.com Cc: ira.weiny@intel.com, rppt@kernel.org, dan.j.williams@intel.com, linux-kernel@vger.kernel.org, Rick Edgecombe <rick.p.edgecombe@intel.com> Subject: [PATCH RFC 3/9] x86/mm/cpa: Add grouped page allocations Date: Tue, 4 May 2021 17:30:26 -0700 Message-Id: <20210505003032.489164-4-rick.p.edgecombe@intel.com> In-Reply-To: <20210505003032.489164-1-rick.p.edgecombe@intel.com> References: <20210505003032.489164-1-rick.p.edgecombe@intel.com> MIME-Version: 1.0 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf17; identity=mailfrom; envelope-from="<rick.p.edgecombe@intel.com>"; helo=mga14.intel.com; client-ip=192.55.52.115 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	PKS write protected page tables \| expand [RFC,0/9] PKS write protected page tables [RFC,1/9] list: Support getting most recent element in list_lru [RFC,2/9] list: Support list head not in object for list_lru [RFC,3/9] x86/mm/cpa: Add grouped page allocations [RFC,4/9] mm: Explicitly zero page table lock ptr [RFC,5/9] x86, mm: Use cache of page tables [RFC,6/9] x86/mm/cpa: Add set_memory_pks() [RFC,7/9] x86/mm/cpa: Add perm callbacks to grouped pages [RFC,8/9] x86, mm: Protect page tables with PKS [RFC,9/9] x86, cpa: PKS protect direct map page tables

Edgecombe, Rick P May 5, 2021, 12:30 a.m. UTC

For x86, setting memory permissions on the direct map results in fracturing
large pages. Direct map fracturing can be reduced by locating pages that
will have their permissions set close together.

Create a simple page cache that allocates pages from huge page size
blocks. Don't guarantee that a page will come from a huge page grouping,
instead fallback to non-grouped pages to fulfill the allocation if
needed. Also, register a shrinker such that the system can ask for the
pages back if needed. Since this is only needed when there is a direct
map, compile it out on highmem systems.

Free pages in the cache are kept track of in per-node list inside a
list_lru. NUMA_NO_NODE requests are serviced by checking each per-node
list in a round robin fashion. If pages are requested for a certain node
but the cache is empty for that node, a whole additional huge page size
page is allocated.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
 arch/x86/include/asm/set_memory.h |  14 +++
 arch/x86/mm/pat/set_memory.c      | 151 ++++++++++++++++++++++++++++++
 2 files changed, 165 insertions(+)

Mike Rapoport May 5, 2021, 12:08 p.m. UTC | #1

On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe wrote:
> For x86, setting memory permissions on the direct map results in fracturing
> large pages. Direct map fracturing can be reduced by locating pages that
> will have their permissions set close together.
> 
> Create a simple page cache that allocates pages from huge page size
> blocks. Don't guarantee that a page will come from a huge page grouping,
> instead fallback to non-grouped pages to fulfill the allocation if
> needed. Also, register a shrinker such that the system can ask for the
> pages back if needed. Since this is only needed when there is a direct
> map, compile it out on highmem systems.

I only had time to skim through the patches, I like the idea of having a
simple cache that allocates larger pages with a fallback to basic page
size.

I just think it should be more generic and closer to the page allocator.
I was thinking about adding a GFP flag that will tell that the allocated
pages should be removed from the direct map. Then alloc_pages() could use
such cache whenever this GFP flag is specified with a fallback for lower
order allocations.
 
> Free pages in the cache are kept track of in per-node list inside a
> list_lru. NUMA_NO_NODE requests are serviced by checking each per-node
> list in a round robin fashion. If pages are requested for a certain node
> but the cache is empty for that node, a whole additional huge page size
> page is allocated.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
>  arch/x86/include/asm/set_memory.h |  14 +++
>  arch/x86/mm/pat/set_memory.c      | 151 ++++++++++++++++++++++++++++++
>  2 files changed, 165 insertions(+)
> 
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 4352f08bfbb5..b63f09cc282a 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -4,6 +4,9 @@
>  
>  #include <asm/page.h>
>  #include <asm-generic/set_memory.h>
> +#include <linux/gfp.h>
> +#include <linux/list_lru.h>
> +#include <linux/shrinker.h>
>  
>  /*
>   * The set_memory_* API can be used to change various attributes of a virtual
> @@ -135,4 +138,15 @@ static inline int clear_mce_nospec(unsigned long pfn)
>   */
>  #endif
>  
> +struct grouped_page_cache {
> +	struct shrinker shrinker;
> +	struct list_lru lru;
> +	gfp_t gfp;
> +	atomic_t nid_round_robin;
> +};
> +
> +int init_grouped_page_cache(struct grouped_page_cache *gpc, gfp_t gfp);
> +struct page *get_grouped_page(int node, struct grouped_page_cache *gpc);
> +void free_grouped_page(struct grouped_page_cache *gpc, struct page *page);
> +
>  #endif /* _ASM_X86_SET_MEMORY_H */
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 16f878c26667..6877ef66793b 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -2306,6 +2306,157 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
>  	return retval;
>  }
>  
> +#ifndef HIGHMEM
> +static struct page *__alloc_page_order(int node, gfp_t gfp_mask, int order)
> +{
> +	if (node == NUMA_NO_NODE)
> +		return alloc_pages(gfp_mask, order);
> +
> +	return alloc_pages_node(node, gfp_mask, order);
> +}
> +
> +static struct grouped_page_cache *__get_gpc_from_sc(struct shrinker *shrinker)
> +{
> +	return container_of(shrinker, struct grouped_page_cache, shrinker);
> +}
> +
> +static unsigned long grouped_shrink_count(struct shrinker *shrinker,
> +					  struct shrink_control *sc)
> +{
> +	struct grouped_page_cache *gpc = __get_gpc_from_sc(shrinker);
> +	unsigned long page_cnt = list_lru_shrink_count(&gpc->lru, sc);
> +
> +	return page_cnt ? page_cnt : SHRINK_EMPTY;
> +}
> +
> +static enum lru_status grouped_isolate(struct list_head *item,
> +				       struct list_lru_one *list,
> +				       spinlock_t *lock, void *cb_arg)
> +{
> +	struct list_head *dispose = cb_arg;
> +
> +	list_lru_isolate_move(list, item, dispose);
> +
> +	return LRU_REMOVED;
> +}
> +
> +static void __dispose_pages(struct grouped_page_cache *gpc, struct list_head *head)
> +{
> +	struct list_head *cur, *next;
> +
> +	list_for_each_safe(cur, next, head) {
> +		struct page *page = list_entry(head, struct page, lru);
> +
> +		list_del(cur);
> +
> +		__free_pages(page, 0);
> +	}
> +}
> +
> +static unsigned long grouped_shrink_scan(struct shrinker *shrinker,
> +					 struct shrink_control *sc)
> +{
> +	struct grouped_page_cache *gpc = __get_gpc_from_sc(shrinker);
> +	unsigned long isolated;
> +	LIST_HEAD(freeable);
> +
> +	if (!(sc->gfp_mask & gpc->gfp))
> +		return SHRINK_STOP;
> +
> +	isolated = list_lru_shrink_walk(&gpc->lru, sc, grouped_isolate,
> +					&freeable);
> +	__dispose_pages(gpc, &freeable);
> +
> +	/* Every item walked gets isolated */
> +	sc->nr_scanned += isolated;
> +
> +	return isolated;
> +}
> +
> +static struct page *__remove_first_page(struct grouped_page_cache *gpc, int node)
> +{
> +	unsigned int start_nid, i;
> +	struct list_head *head;
> +
> +	if (node != NUMA_NO_NODE) {
> +		head = list_lru_get_mru(&gpc->lru, node);
> +		if (head)
> +			return list_entry(head, struct page, lru);
> +		return NULL;
> +	}
> +
> +	/* If NUMA_NO_NODE, search the nodes in round robin for a page */
> +	start_nid = (unsigned int)atomic_fetch_inc(&gpc->nid_round_robin) % nr_node_ids;
> +	for (i = 0; i < nr_node_ids; i++) {
> +		int cur_nid = (start_nid + i) % nr_node_ids;
> +
> +		head = list_lru_get_mru(&gpc->lru, cur_nid);
> +		if (head)
> +			return list_entry(head, struct page, lru);
> +	}
> +
> +	return NULL;
> +}
> +
> +/* Get and add some new pages to the cache to be used by VM_GROUP_PAGES */
> +static struct page *__replenish_grouped_pages(struct grouped_page_cache *gpc, int node)
> +{
> +	const unsigned int hpage_cnt = HPAGE_SIZE >> PAGE_SHIFT;
> +	struct page *page;
> +	int i;
> +
> +	page = __alloc_page_order(node, gpc->gfp, HUGETLB_PAGE_ORDER);
> +	if (!page)
> +		return __alloc_page_order(node, gpc->gfp, 0);
> +
> +	split_page(page, HUGETLB_PAGE_ORDER);
> +
> +	for (i = 1; i < hpage_cnt; i++)
> +		free_grouped_page(gpc, &page[i]);
> +
> +	return &page[0];
> +}
> +
> +int init_grouped_page_cache(struct grouped_page_cache *gpc, gfp_t gfp)
> +{
> +	int err = 0;
> +
> +	memset(gpc, 0, sizeof(struct grouped_page_cache));
> +
> +	if (list_lru_init(&gpc->lru))
> +		goto out;
> +
> +	gpc->shrinker.count_objects = grouped_shrink_count;
> +	gpc->shrinker.scan_objects = grouped_shrink_scan;
> +	gpc->shrinker.seeks = DEFAULT_SEEKS;
> +	gpc->shrinker.flags = SHRINKER_NUMA_AWARE;
> +
> +	err = register_shrinker(&gpc->shrinker);
> +	if (err)
> +		list_lru_destroy(&gpc->lru);
> +
> +out:
> +	return err;
> +}
> +
> +struct page *get_grouped_page(int node, struct grouped_page_cache *gpc)
> +{
> +	struct page *page;
> +
> +	page = __remove_first_page(gpc, node);
> +
> +	if (page)
> +		return page;
> +
> +	return __replenish_grouped_pages(gpc, node);
> +}
> +
> +void free_grouped_page(struct grouped_page_cache *gpc, struct page *page)
> +{
> +	INIT_LIST_HEAD(&page->lru);
> +	list_lru_add_node(&gpc->lru, &page->lru, page_to_nid(page));
> +}
> +#endif /* !HIGHMEM */
>  /*
>   * The testcases use internal knowledge of the implementation that shouldn't
>   * be exposed to the rest of the kernel. Include these directly here.
> -- 
> 2.30.2
>

Peter Zijlstra May 5, 2021, 1:09 p.m. UTC | #2

On Wed, May 05, 2021 at 03:08:27PM +0300, Mike Rapoport wrote:
> On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe wrote:
> > For x86, setting memory permissions on the direct map results in fracturing
> > large pages. Direct map fracturing can be reduced by locating pages that
> > will have their permissions set close together.
> > 
> > Create a simple page cache that allocates pages from huge page size
> > blocks. Don't guarantee that a page will come from a huge page grouping,
> > instead fallback to non-grouped pages to fulfill the allocation if
> > needed. Also, register a shrinker such that the system can ask for the
> > pages back if needed. Since this is only needed when there is a direct
> > map, compile it out on highmem systems.
> 
> I only had time to skim through the patches, I like the idea of having a
> simple cache that allocates larger pages with a fallback to basic page
> size.
> 
> I just think it should be more generic and closer to the page allocator.
> I was thinking about adding a GFP flag that will tell that the allocated
> pages should be removed from the direct map. Then alloc_pages() could use
> such cache whenever this GFP flag is specified with a fallback for lower
> order allocations.

That doesn't provide enough information I think. Removing from direct
map isn't the only consideration, you also want to group them by the
target protection bits such that we don't get to use 4k pages quite so
much.

Mike Rapoport May 5, 2021, 6:45 p.m. UTC | #3

On Wed, May 05, 2021 at 03:09:12PM +0200, Peter Zijlstra wrote:
> On Wed, May 05, 2021 at 03:08:27PM +0300, Mike Rapoport wrote:
> > On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe wrote:
> > > For x86, setting memory permissions on the direct map results in fracturing
> > > large pages. Direct map fracturing can be reduced by locating pages that
> > > will have their permissions set close together.
> > > 
> > > Create a simple page cache that allocates pages from huge page size
> > > blocks. Don't guarantee that a page will come from a huge page grouping,
> > > instead fallback to non-grouped pages to fulfill the allocation if
> > > needed. Also, register a shrinker such that the system can ask for the
> > > pages back if needed. Since this is only needed when there is a direct
> > > map, compile it out on highmem systems.
> > 
> > I only had time to skim through the patches, I like the idea of having a
> > simple cache that allocates larger pages with a fallback to basic page
> > size.
> > 
> > I just think it should be more generic and closer to the page allocator.
> > I was thinking about adding a GFP flag that will tell that the allocated
> > pages should be removed from the direct map. Then alloc_pages() could use
> > such cache whenever this GFP flag is specified with a fallback for lower
> > order allocations.
> 
> That doesn't provide enough information I think. Removing from direct
> map isn't the only consideration, you also want to group them by the
> target protection bits such that we don't get to use 4k pages quite so
> much.

Unless I'm missing something we anyway hand out 4k pages from the cache and
the neighbouring 4k may end up with different protections.

This is also similar to what happens in the set Rick posted a while ago to
support grouped vmalloc allocations:

[1] https://lore.kernel.org/lkml/20210405203711.1095940-1-rick.p.edgecombe@intel.com/

Edgecombe, Rick P May 5, 2021, 9:57 p.m. UTC | #4

On Wed, 2021-05-05 at 21:45 +0300, Mike Rapoport wrote:
> On Wed, May 05, 2021 at 03:09:12PM +0200, Peter Zijlstra wrote:
> > On Wed, May 05, 2021 at 03:08:27PM +0300, Mike Rapoport wrote:
> > > On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe wrote:
> > > > For x86, setting memory permissions on the direct map results
> > > > in fracturing
> > > > large pages. Direct map fracturing can be reduced by locating
> > > > pages that
> > > > will have their permissions set close together.
> > > > 
> > > > Create a simple page cache that allocates pages from huge page
> > > > size
> > > > blocks. Don't guarantee that a page will come from a huge page
> > > > grouping,
> > > > instead fallback to non-grouped pages to fulfill the allocation
> > > > if
> > > > needed. Also, register a shrinker such that the system can ask
> > > > for the
> > > > pages back if needed. Since this is only needed when there is a
> > > > direct
> > > > map, compile it out on highmem systems.
> > > 
> > > I only had time to skim through the patches, I like the idea of
> > > having a
> > > simple cache that allocates larger pages with a fallback to basic
> > > page
> > > size.
> > > 
> > > I just think it should be more generic and closer to the page
> > > allocator.
> > > I was thinking about adding a GFP flag that will tell that the
> > > allocated
> > > pages should be removed from the direct map. Then alloc_pages()
> > > could use
> > > such cache whenever this GFP flag is specified with a fallback
> > > for lower
> > > order allocations.
> > 
> > That doesn't provide enough information I think. Removing from
> > direct
> > map isn't the only consideration, you also want to group them by
> > the
> > target protection bits such that we don't get to use 4k pages quite
> > so
> > much.
> 
> Unless I'm missing something we anyway hand out 4k pages from the
> cache and
> the neighbouring 4k may end up with different protections.
> 
> This is also similar to what happens in the set Rick posted a while
> ago to
> support grouped vmalloc allocations:
> 

One issue is with the shrinker callbacks. If you are just trying to
reset and free a single page because the system is low on memory, it
could be problematic to have to break a large page, which would require
another page.

I think for vmalloc, eventually it should just have the direct map
alias unmapped. The reason it was not in the linked patch, is just to
iteratively move in the direction of having permissioned vmallocs be
unmapped.

Mike Rapoport May 9, 2021, 9:39 a.m. UTC | #5

On Wed, May 05, 2021 at 09:57:17PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2021-05-05 at 21:45 +0300, Mike Rapoport wrote:
> > On Wed, May 05, 2021 at 03:09:12PM +0200, Peter Zijlstra wrote:
> > > On Wed, May 05, 2021 at 03:08:27PM +0300, Mike Rapoport wrote:
> > > > On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe wrote:
> > > > > For x86, setting memory permissions on the direct map results
> > > > > in fracturing
> > > > > large pages. Direct map fracturing can be reduced by locating
> > > > > pages that
> > > > > will have their permissions set close together.
> > > > > 
> > > > > Create a simple page cache that allocates pages from huge page
> > > > > size
> > > > > blocks. Don't guarantee that a page will come from a huge page
> > > > > grouping,
> > > > > instead fallback to non-grouped pages to fulfill the allocation
> > > > > if
> > > > > needed. Also, register a shrinker such that the system can ask
> > > > > for the
> > > > > pages back if needed. Since this is only needed when there is a
> > > > > direct
> > > > > map, compile it out on highmem systems.
> > > > 
> > > > I only had time to skim through the patches, I like the idea of
> > > > having a
> > > > simple cache that allocates larger pages with a fallback to basic
> > > > page
> > > > size.
> > > > 
> > > > I just think it should be more generic and closer to the page
> > > > allocator.
> > > > I was thinking about adding a GFP flag that will tell that the
> > > > allocated
> > > > pages should be removed from the direct map. Then alloc_pages()
> > > > could use
> > > > such cache whenever this GFP flag is specified with a fallback
> > > > for lower
> > > > order allocations.
> > > 
> > > That doesn't provide enough information I think. Removing from
> > > direct
> > > map isn't the only consideration, you also want to group them by
> > > the
> > > target protection bits such that we don't get to use 4k pages quite
> > > so
> > > much.
> > 
> > Unless I'm missing something we anyway hand out 4k pages from the
> > cache and
> > the neighbouring 4k may end up with different protections.
> > 
> > This is also similar to what happens in the set Rick posted a while
> > ago to
> > support grouped vmalloc allocations:
> > 
> 
> One issue is with the shrinker callbacks. If you are just trying to
> reset and free a single page because the system is low on memory, it
> could be problematic to have to break a large page, which would require
> another page.

I don't follow you here. Maybe I've misread the patches but AFAIU the large
page is broken at allocation time and 4k pages remain 4k pages afterwards.

In my understanding the problem with a simple shrinker is that even if we
have the entire 2M free it is not being reinstated as 2M page in the direct
mapping.

Edgecombe, Rick P May 10, 2021, 7:38 p.m. UTC | #6

On Sun, 2021-05-09 at 12:39 +0300, Mike Rapoport wrote:
> On Wed, May 05, 2021 at 09:57:17PM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2021-05-05 at 21:45 +0300, Mike Rapoport wrote:
> > > On Wed, May 05, 2021 at 03:09:12PM +0200, Peter Zijlstra wrote:
> > > > On Wed, May 05, 2021 at 03:08:27PM +0300, Mike Rapoport wrote:
> > > > > On Tue, May 04, 2021 at 05:30:26PM -0700, Rick Edgecombe
> > > > > wrote:
> > > > > > For x86, setting memory permissions on the direct map
> > > > > > results
> > > > > > in fracturing
> > > > > > large pages. Direct map fracturing can be reduced by
> > > > > > locating
> > > > > > pages that
> > > > > > will have their permissions set close together.
> > > > > > 
> > > > > > Create a simple page cache that allocates pages from huge
> > > > > > page
> > > > > > size
> > > > > > blocks. Don't guarantee that a page will come from a huge
> > > > > > page
> > > > > > grouping,
> > > > > > instead fallback to non-grouped pages to fulfill the
> > > > > > allocation
> > > > > > if
> > > > > > needed. Also, register a shrinker such that the system can
> > > > > > ask
> > > > > > for the
> > > > > > pages back if needed. Since this is only needed when there
> > > > > > is a
> > > > > > direct
> > > > > > map, compile it out on highmem systems.
> > > > > 
> > > > > I only had time to skim through the patches, I like the idea
> > > > > of
> > > > > having a
> > > > > simple cache that allocates larger pages with a fallback to
> > > > > basic
> > > > > page
> > > > > size.
> > > > > 
> > > > > I just think it should be more generic and closer to the page
> > > > > allocator.
> > > > > I was thinking about adding a GFP flag that will tell that
> > > > > the
> > > > > allocated
> > > > > pages should be removed from the direct map. Then
> > > > > alloc_pages()
> > > > > could use
> > > > > such cache whenever this GFP flag is specified with a
> > > > > fallback
> > > > > for lower
> > > > > order allocations.
> > > > 
> > > > That doesn't provide enough information I think. Removing from
> > > > direct
> > > > map isn't the only consideration, you also want to group them
> > > > by
> > > > the
> > > > target protection bits such that we don't get to use 4k pages
> > > > quite
> > > > so
> > > > much.
> > > 
> > > Unless I'm missing something we anyway hand out 4k pages from the
> > > cache and
> > > the neighbouring 4k may end up with different protections.
> > > 
> > > This is also similar to what happens in the set Rick posted a
> > > while
> > > ago to
> > > support grouped vmalloc allocations:
> > > 
> > 
> > One issue is with the shrinker callbacks. If you are just trying to
> > reset and free a single page because the system is low on memory,
> > it
> > could be problematic to have to break a large page, which would
> > require
> > another page.
> 
> I don't follow you here. Maybe I've misread the patches but AFAIU the
> large
> page is broken at allocation time and 4k pages remain 4k pages
> afterwards.

Yea that's right.

I thought Peter was saying if the page allocator grouped all of the
same permission together it could often leave the direct map as large
pages, and so the page allocator would have to know about permissions.

So I was just trying to say, to leave large pages on the direct map,
the shrinker has to handle breaking a page while freeing a single page.
So that would have to be addressed to get large pages with permissions
in the first place.

It doesn't seem impossible to solve I guess, so maybe not an important
point. It could maybe just hold a page in reserve.

Now that I think about it, since this PKS tables series holds all
potentially needed direct map page tables in reserve, it shouldn't
actually be a problem for this case. So this could leave the PKS tables
pages as large on the direct map.

> In my understanding the problem with a simple shrinker is that even
> if we
> have the entire 2M free it is not being reinstated as 2M page in the
> direct
> mapping.

Yea, that is a downside to this simple shrinker.

[RFC,3/9] x86/mm/cpa: Add grouped page allocations

Commit Message

Comments

Patch