From patchwork Thu Mar 16 13:15:19 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Paolo Bonzini <pbonzini@redhat.com>
X-Patchwork-Id: 9628111
X-Patchwork-Delegate: herbert@gondor.apana.org.au
Return-Path: <linux-crypto-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	2CA6360244 for <patchwork-linux-crypto@patchwork.kernel.org>;
	Thu, 16 Mar 2017 13:16:07 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1DFD428644
	for <patchwork-linux-crypto@patchwork.kernel.org>;
	Thu, 16 Mar 2017 13:16:07 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 0E9962864C; Thu, 16 Mar 2017 13:16:07 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI
	autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2711528644
	for <patchwork-linux-crypto@patchwork.kernel.org>;
	Thu, 16 Mar 2017 13:16:06 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751487AbdCPNQF (ORCPT
	<rfc822;patchwork-linux-crypto@patchwork.kernel.org>);
	Thu, 16 Mar 2017 09:16:05 -0400
Received: from mx1.redhat.com ([209.132.183.28]:52268 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751357AbdCPNQD (ORCPT <rfc822;linux-crypto@vger.kernel.org>);
	Thu, 16 Mar 2017 09:16:03 -0400
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
	(int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 807987E9D3;
	Thu, 16 Mar 2017 13:15:35 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 807987E9D3
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com;
	dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com;
	spf=pass smtp.mailfrom=pbonzini@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 807987E9D3
Received: from [10.36.117.7] (ovpn-117-7.ams2.redhat.com [10.36.117.7])
	by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	ESMTP id v2GDFJ58023994
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 16 Mar 2017 09:15:21 -0400
Subject: Re: [RFC PATCH v2 14/32] x86: mm: Provide support to use memblock
	when spliting large pages
To: Brijesh Singh <brijesh.singh@amd.com>, Borislav Petkov <bp@suse.de>
References: 
 <148846752022.2349.13667498174822419498.stgit@brijesh-build-machine>
	<148846771545.2349.9373586041426414252.stgit@brijesh-build-machine>
	<20170310110657.hophlog2juw5hpzz@pd.tnic>
	<cb6a9a56-2c52-d98d-3ff6-3b61d0e5875e@amd.com>
Cc: simon.guinot@sequanux.org, linux-efi@vger.kernel.org,
	kvm@vger.kernel.org, rkrcmar@redhat.com, matt@codeblueprint.co.uk,
	linux-pci@vger.kernel.org, linus.walleij@linaro.org,
	gary.hook@amd.com, linux-mm@kvack.org,
	paul.gortmaker@windriver.com, hpa@zytor.com, cl@linux.com,
	dan.j.williams@intel.com, aarcange@redhat.com,
	sfr@canb.auug.org.au, andriy.shevchenko@linux.intel.com,
	herbert@gondor.apana.org.au, bhe@redhat.com, xemul@parallels.com,
	joro@8bytes.org, x86@kernel.org, peterz@infradead.org,
	piotr.luc@intel.com, mingo@redhat.com, msalter@redhat.com,
	ross.zwisler@linux.intel.com, dyoung@redhat.com,
	thomas.lendacky@amd.com, jroedel@suse.de, keescook@chromium.org,
	arnd@arndb.de, toshi.kani@hpe.com, mathieu.desnoyers@efficios.com,
	luto@kernel.org, devel@linuxdriverproject.org, bhelgaas@google.com,
	tglx@linutronix.de, mchehab@kernel.org, iamjoonsoo.kim@lge.com,
	labbott@fedoraproject.org, tony.luck@intel.com,
	alexandre.bounine@idt.com, kuleshovmail@gmail.com,
	linux-kernel@vger.kernel.org, mcgrof@kernel.org, mst@redhat.com,
	linux-crypto@vger.kernel.org, tj@kernel.org,
	akpm@linux-foundation.org, davem@davemloft.net
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <c3b12452-9533-87c8-4543-8f12f5c90fe3@redhat.com>
Date: Thu, 16 Mar 2017 14:15:19 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
	Thunderbird/45.7.0
MIME-Version: 1.0
In-Reply-To: <cb6a9a56-2c52-d98d-3ff6-3b61d0e5875e@amd.com>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
	(mx1.redhat.com [10.5.110.26]);
	Thu, 16 Mar 2017 13:15:38 +0000 (UTC)
Sender: linux-crypto-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-crypto.vger.kernel.org>
X-Mailing-List: linux-crypto@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 10/03/2017 23:41, Brijesh Singh wrote:
>> Maybe there's a reason this fires:
>>
>> WARNING: modpost: Found 2 section mismatch(es).
>> To see full details build your kernel with:
>> 'make CONFIG_DEBUG_SECTION_MISMATCH=y'
>>
>> WARNING: vmlinux.o(.text+0x48edc): Section mismatch in reference from
>> the function __change_page_attr() to the function
>> .init.text:memblock_alloc()
>> The function __change_page_attr() references
>> the function __init memblock_alloc().
>> This is often because __change_page_attr lacks a __init
>> annotation or the annotation of memblock_alloc is wrong.
>>
>> WARNING: vmlinux.o(.text+0x491d1): Section mismatch in reference from
>> the function __change_page_attr() to the function
>> .meminit.text:memblock_free()
>> The function __change_page_attr() references
>> the function __meminit memblock_free().
>> This is often because __change_page_attr lacks a __meminit
>> annotation or the annotation of memblock_free is wrong.
>> 
>> But maybe Paolo might have an even better idea...
> 
> I am sure he will have better idea :)

Not sure if it's better or worse, but an alternative idea is to turn
__change_page_attr and __change_page_attr_set_clr inside out, so that:
1) the alloc_pages/__free_page happens in __change_page_attr_set_clr;
2) __change_page_attr_set_clr overall does not beocome more complex.

Then you can introduce __early_change_page_attr_set_clr and/or
early_kernel_map_pages_in_pgd, for use in your next patches.  They use
the memblock allocator instead of alloc/free_page

The attached patch is compile-tested only and almost certainly has some
thinko in it.  But it even skims a few lines from the code so the idea
might have some merit.

Paolo

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 28d42130243c..953c8e697562 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -490,11 +490,12 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
 }
 
 static int
-try_preserve_large_page(pte_t *kpte, unsigned long address,
+try_preserve_large_page(pte_t **p_kpte, unsigned long address,
 			struct cpa_data *cpa)
 {
 	unsigned long nextpage_addr, numpages, pmask, psize, addr, pfn, old_pfn;
-	pte_t new_pte, old_pte, *tmp;
+	pte_t *kpte = *p_kpte;
+	pte_t new_pte, old_pte;
 	pgprot_t old_prot, new_prot, req_prot;
 	int i, do_split = 1;
 	enum pg_level level;
@@ -507,8 +508,8 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	 * Check for races, another CPU might have split this page
 	 * up already:
 	 */
-	tmp = _lookup_address_cpa(cpa, address, &level);
-	if (tmp != kpte)
+	*p_kpte = _lookup_address_cpa(cpa, address, &level);
+	if (*p_kpte != kpte)
 		goto out_unlock;
 
 	switch (level) {
@@ -634,17 +635,18 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	unsigned int i, level;
 	pte_t *tmp;
 	pgprot_t ref_prot;
+	int retry = 1;
 
+	if (!debug_pagealloc_enabled())
+		spin_lock(&cpa_lock);
 	spin_lock(&pgd_lock);
 	/*
 	 * Check for races, another CPU might have split this page
 	 * up for us already:
 	 */
 	tmp = _lookup_address_cpa(cpa, address, &level);
-	if (tmp != kpte) {
-		spin_unlock(&pgd_lock);
-		return 1;
-	}
+	if (tmp != kpte)
+		goto out;
 
 	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
 
@@ -671,10 +673,11 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 		break;
 
 	default:
-		spin_unlock(&pgd_lock);
-		return 1;
+		goto out;
 	}
 
+	retry = 0;
+
 	/*
 	 * Set the GLOBAL flags only if the PRESENT flag is set
 	 * otherwise pmd/pte_present will return true even on a non
@@ -718,28 +721,34 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * going on.
 	 */
 	__flush_tlb_all();
-	spin_unlock(&pgd_lock);
 
-	return 0;
-}
-
-static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
-			    unsigned long address)
-{
-	struct page *base;
+out:
+	spin_unlock(&pgd_lock);
 
+	/*
+	 * Do a global flush tlb after splitting the large page
+ 	 * and before we do the actual change page attribute in the PTE.
+ 	 *
+ 	 * With out this, we violate the TLB application note, that says
+ 	 * "The TLBs may contain both ordinary and large-page
+	 *  translations for a 4-KByte range of linear addresses. This
+	 *  may occur if software modifies the paging structures so that
+	 *  the page size used for the address range changes. If the two
+	 *  translations differ with respect to page frame or attributes
+	 *  (e.g., permissions), processor behavior is undefined and may
+	 *  be implementation-specific."
+ 	 *
+ 	 * We do this global tlb flush inside the cpa_lock, so that we
+	 * don't allow any other cpu, with stale tlb entries change the
+	 * page attribute in parallel, that also falls into the
+	 * just split large page entry.
+ 	 */
+	if (!retry)
+		flush_tlb_all();
 	if (!debug_pagealloc_enabled())
 		spin_unlock(&cpa_lock);
-	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
-	if (!debug_pagealloc_enabled())
-		spin_lock(&cpa_lock);
-	if (!base)
-		return -ENOMEM;
-
-	if (__split_large_page(cpa, kpte, address, base))
-		__free_page(base);
 
-	return 0;
+	return retry;
 }
 
 static bool try_to_free_pte_page(pte_t *pte)
@@ -1166,30 +1175,26 @@ static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr,
 	}
 }
 
-static int __change_page_attr(struct cpa_data *cpa, int primary)
+static int __change_page_attr(struct cpa_data *cpa, pte_t **p_kpte, unsigned long address,
+			      int primary)
 {
-	unsigned long address;
-	int do_split, err;
 	unsigned int level;
 	pte_t *kpte, old_pte;
+	int err = 0;
 
-	if (cpa->flags & CPA_PAGES_ARRAY) {
-		struct page *page = cpa->pages[cpa->curpage];
-		if (unlikely(PageHighMem(page)))
-			return 0;
-		address = (unsigned long)page_address(page);
-	} else if (cpa->flags & CPA_ARRAY)
-		address = cpa->vaddr[cpa->curpage];
-	else
-		address = *cpa->vaddr;
-repeat:
-	kpte = _lookup_address_cpa(cpa, address, &level);
-	if (!kpte)
-		return __cpa_process_fault(cpa, address, primary);
+	if (!debug_pagealloc_enabled())
+		spin_lock(&cpa_lock);
+	*p_kpte = kpte = _lookup_address_cpa(cpa, address, &level);
+	if (!kpte) {
+		err = __cpa_process_fault(cpa, address, primary);
+		goto out;
+	}
 
 	old_pte = *kpte;
-	if (pte_none(old_pte))
-		return __cpa_process_fault(cpa, address, primary);
+	if (pte_none(old_pte)) {
+		err = __cpa_process_fault(cpa, address, primary);
+		goto out;
+	}
 
 	if (level == PG_LEVEL_4K) {
 		pte_t new_pte;
@@ -1228,59 +1233,27 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 			cpa->flags |= CPA_FLUSHTLB;
 		}
 		cpa->numpages = 1;
-		return 0;
+		goto out;
 	}
 
 	/*
 	 * Check, whether we can keep the large page intact
 	 * and just change the pte:
 	 */
-	do_split = try_preserve_large_page(kpte, address, cpa);
-	/*
-	 * When the range fits into the existing large page,
-	 * return. cp->numpages and cpa->tlbflush have been updated in
-	 * try_large_page:
-	 */
-	if (do_split <= 0)
-		return do_split;
-
-	/*
-	 * We have to split the large page:
-	 */
-	err = split_large_page(cpa, kpte, address);
-	if (!err) {
-		/*
-	 	 * Do a global flush tlb after splitting the large page
-	 	 * and before we do the actual change page attribute in the PTE.
-	 	 *
-	 	 * With out this, we violate the TLB application note, that says
-	 	 * "The TLBs may contain both ordinary and large-page
-		 *  translations for a 4-KByte range of linear addresses. This
-		 *  may occur if software modifies the paging structures so that
-		 *  the page size used for the address range changes. If the two
-		 *  translations differ with respect to page frame or attributes
-		 *  (e.g., permissions), processor behavior is undefined and may
-		 *  be implementation-specific."
-	 	 *
-	 	 * We do this global tlb flush inside the cpa_lock, so that we
-		 * don't allow any other cpu, with stale tlb entries change the
-		 * page attribute in parallel, that also falls into the
-		 * just split large page entry.
-	 	 */
-		flush_tlb_all();
-		goto repeat;
-	}
+	err = try_preserve_large_page(p_kpte, address, cpa);
 
+out:
+	if (!debug_pagealloc_enabled())
+		spin_unlock(&cpa_lock);
 	return err;
 }
 
 static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias);
 
-static int cpa_process_alias(struct cpa_data *cpa)
+static int cpa_process_alias(struct cpa_data *cpa, unsigned long vaddr)
 {
 	struct cpa_data alias_cpa;
 	unsigned long laddr = (unsigned long)__va(cpa->pfn << PAGE_SHIFT);
-	unsigned long vaddr;
 	int ret;
 
 	if (!pfn_range_is_mapped(cpa->pfn, cpa->pfn + 1))
@@ -1290,16 +1263,6 @@ static int cpa_process_alias(struct cpa_data *cpa)
 	 * No need to redo, when the primary call touched the direct
 	 * mapping already:
 	 */
-	if (cpa->flags & CPA_PAGES_ARRAY) {
-		struct page *page = cpa->pages[cpa->curpage];
-		if (unlikely(PageHighMem(page)))
-			return 0;
-		vaddr = (unsigned long)page_address(page);
-	} else if (cpa->flags & CPA_ARRAY)
-		vaddr = cpa->vaddr[cpa->curpage];
-	else
-		vaddr = *cpa->vaddr;
-
 	if (!(within(vaddr, PAGE_OFFSET,
 		    PAGE_OFFSET + (max_pfn_mapped << PAGE_SHIFT)))) {
 
@@ -1338,33 +1301,64 @@ static int cpa_process_alias(struct cpa_data *cpa)
 	return 0;
 }
 
+static unsigned long cpa_address(struct cpa_data *cpa, unsigned long numpages)
+{
+	/*
+	 * Store the remaining nr of pages for the large page
+	 * preservation check.
+	 */
+	/* for array changes, we can't use large page */
+	cpa->numpages = 1;
+	if (cpa->flags & CPA_PAGES_ARRAY) {
+		struct page *page = cpa->pages[cpa->curpage];
+		if (unlikely(PageHighMem(page)))
+			return -EINVAL;
+		return (unsigned long)page_address(page);
+	} else if (cpa->flags & CPA_ARRAY) {
+		return cpa->vaddr[cpa->curpage];
+	} else {
+		cpa->numpages = numpages;
+		return *cpa->vaddr;
+	}
+}
+
+static void cpa_advance(struct cpa_data *cpa)
+{
+	if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY))
+		cpa->curpage++;
+	else
+		*cpa->vaddr += cpa->numpages * PAGE_SIZE;
+}
+
 static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
 {
 	unsigned long numpages = cpa->numpages;
+	unsigned long vaddr;
+	struct page *base;
+	pte_t *kpte;
 	int ret;
 
 	while (numpages) {
-		/*
-		 * Store the remaining nr of pages for the large page
-		 * preservation check.
-		 */
-		cpa->numpages = numpages;
-		/* for array changes, we can't use large page */
-		if (cpa->flags & (CPA_ARRAY | CPA_PAGES_ARRAY))
-			cpa->numpages = 1;
-
-		if (!debug_pagealloc_enabled())
-			spin_lock(&cpa_lock);
-		ret = __change_page_attr(cpa, checkalias);
-		if (!debug_pagealloc_enabled())
-			spin_unlock(&cpa_lock);
-		if (ret)
-			return ret;
-
-		if (checkalias) {
-			ret = cpa_process_alias(cpa);
-			if (ret)
+		vaddr = cpa_address(cpa, numpages);
+		if (!IS_ERR_VALUE(vaddr)) {
+repeat:
+			ret = __change_page_attr(cpa, &kpte, vaddr, checkalias);
+			if (ret < 0)
 				return ret;
+			if (ret) {
+				base = alloc_page(GFP_KERNEL|__GFP_NOTRACK);
+				if (!base)
+					return -ENOMEM;
+				if (__split_large_page(cpa, kpte, vaddr, base))
+					__free_page(base);
+				goto repeat;
+			}
+
+			if (checkalias) {
+				ret = cpa_process_alias(cpa, vaddr);
+				if (ret < 0)
+					return ret;
+			}
 		}
 
 		/*
@@ -1374,11 +1368,7 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int checkalias)
 		 */
 		BUG_ON(cpa->numpages > numpages || !cpa->numpages);
 		numpages -= cpa->numpages;
-		if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY))
-			cpa->curpage++;
-		else
-			*cpa->vaddr += cpa->numpages * PAGE_SIZE;
-
+		cpa_advance(cpa);
 	}
 	return 0;
 }