diff mbox series

[RFC,3/3] arm64: hibernate: idmap the single page that holds the copy page routines

Message ID 20200115143322.214247-4-james.morse@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64: hibernate: idmap the single page that holds the copy page routines | expand

Commit Message

James Morse Jan. 15, 2020, 2:33 p.m. UTC
To resume from hibernate, the contents of memory are restored from
the swap image. This may overwrite any page, including the running
kernel and its page tables.

Hibernate copies the code it uses to do the restore into a single
page that it knows won't be overwritten, and maps it with page tables
built from pages that won't be overwritten.

Today the address it uses for this mapping is arbitrary, but to allow
kexec to reuse this code, it needs to be idmapped. To idmap the page
we must avoid the kernel helpers that have VA_BITS baked in.

Convert create_single_mapping() to take a single PA, and idmap it.
The page tables are built in the reverse order to normal using
pfn_pte() to stir in any bits between 52:48. T0SZ is always increased
to cover 48bits, or 52 if the copy code has bits 52:48 in its PA.

Signed-off-by: James Morse <james.morse@arm.com>
---
 arch/arm64/kernel/hibernate.c | 109 ++++++++++++++++------------------
 1 file changed, 50 insertions(+), 59 deletions(-)

Comments

Pasha Tatashin March 20, 2020, 9:22 p.m. UTC | #1
Hi James,

Sorry, for a slow response.

Soon, I will send out updated MMU enabled kexec series which will have
this work included. I appreciate your help with this.

> Today the address it uses for this mapping is arbitrary, but to allow
> kexec to reuse this code, it needs to be idmapped. To idmap the page
> we must avoid the kernel helpers that have VA_BITS baked in.

Makes sense.

> Convert create_single_mapping() to take a single PA, and idmap it.

I like the idea of using idmap in both places!

> The page tables are built in the reverse order to normal using
> pfn_pte() to stir in any bits between 52:48. T0SZ is always increased
> to cover 48bits, or 52 if the copy code has bits 52:48 in its PA.

I do not think this will work for kexec case. In hibernate we map only
one page, so we can allocate every level from bottom to top, but in
kexec we map many pages. So, upper levels might already exist. I think
we will  need to modify the loop to still go from top to bottom.

Otherwise, this work makes sense. I will integrate it into my series.

Thank you,
Pasha
James Morse March 25, 2020, 9:58 a.m. UTC | #2
Hi Pavel,

On 3/20/20 9:22 PM, Pavel Tatashin wrote:
> Soon, I will send out updated MMU enabled kexec series which will have
> this work included. I appreciate your help with this.
> 
>> Today the address it uses for this mapping is arbitrary, but to allow
>> kexec to reuse this code, it needs to be idmapped. To idmap the page
>> we must avoid the kernel helpers that have VA_BITS baked in.
> 
> Makes sense.

>> Convert create_single_mapping() to take a single PA, and idmap it.
> 
> I like the idea of using idmap in both places!

This is the only way this should work. Both hibernate and kexec replace
all of memory, with the MMU on, while using a temporary set of page tables.

As much of the code that does this should be shared.

Hibernate already does all of this, so kexec should re-use that code.


>> The page tables are built in the reverse order to normal using
>> pfn_pte() to stir in any bits between 52:48. T0SZ is always increased
>> to cover 48bits, or 52 if the copy code has bits 52:48 in its PA.
> 
> I do not think this will work for kexec case. In hibernate we map only
> one page, so we can allocate every level from bottom to top, but in
> kexec we map many pages. So, upper levels might already exist. I think
> we will  need to modify the loop to still go from top to bottom.

No.

We should not have a second set of helpers for building page tables for
kexec, its an unnecessary maintenance burden.


You keep coming back to this because you are trying to idmap all memory
on arm64. You do not need to do this.

You only need one page idmaped so you can switch TTBR1_EL1, and turn the
MMU off.


You can do the copy of memory using a copy of the linear map in
TTBR1_EL1. For an example: hibernate does exactly this.

This saves all the hassle of nomap, reserved-firmware pages and the risk
of introducing mismatched attributes. (which would lead to mysterious
coherency issues for the next kernel)

Your answer is going to be that kexec's data structures are physically
addressed. The linear map, is linear: You can convert the
kexec:physical-address to a KASLR'd linear-map virtual address, with
addition. (beware, the kaslr offset is _signed_, it can be negative!)


The code in this RFC was particularly tricky to test as its behaviour
depends on which bits of a pointer are set.

This code is complicated, and impossible to debug if it goes wrong.
(photograph of a screen with the word 'Bye' on it anyone?). Worse: it
must not introduce coherency issues into the next kernel.

It must be as simple as possible. What you are proposing is not.


Thanks,

James
Pasha Tatashin March 25, 2020, 1:29 p.m. UTC | #3
Hi James,

> You keep coming back to this because you are trying to idmap all memory
> on arm64. You do not need to do this.

No, this is not what I am trying to do. That approach was done in my
first RFC, but I have since abandoned it, and I now have proper liner
copy configured in TTBR1:
See: https://lore.kernel.org/lkml/20191204155938.2279686-24-pasha.tatashin@soleen.com
+/*
+ * Map source segments starting from KEXEC_SRC_START, and map destination
+ * segments starting from KEXEC_DST_START, and return size of copy in
+ * *copy_len argument.
+ * Relocation function essentially needs to do:
+ * memcpy(KEXEC_DST_START, KEXEC_SRC_START, copy_len);
+ */

Sorry, I made a misleading comment that kexec needs to idmap many
pages, in fact it actually needs to idmap only two pages with the
current approach:

1. relocation function
2. relocation function argument

I could fit both of them into a single pages (the relocation function
body is tiny, and argument only contains 9 fields, so 72 bytes), it
will be a little ugly though to have them setup like that, so if you
have a better suggestion please let me know.

> You only need one page idmaped so you can switch TTBR1_EL1, and turn the
> MMU off.
>
>
> You can do the copy of memory using a copy of the linear map in
> TTBR1_EL1. For an example: hibernate does exactly this.

Yes, this is exactly what I am currently doing.

> The code in this RFC was particularly tricky to test as its behaviour
> depends on which bits of a pointer are set.
>
> This code is complicated, and impossible to debug if it goes wrong.
> (photograph of a screen with the word 'Bye' on it anyone?). Worse: it
> must not introduce coherency issues into the next kernel.
>
> It must be as simple as possible. What you are proposing is not.
>

I agree. So, let me modify kexec to idmap exactly one page (I will
stuff argument and body into a single page), and re-use it with
hibernate as you proposed.

Thank you,
Pasha
Pasha Tatashin March 25, 2020, 1:41 p.m. UTC | #4
On Wed, Mar 25, 2020 at 9:29 AM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Hi James,
>
> > You keep coming back to this because you are trying to idmap all memory
> > on arm64. You do not need to do this.
>
> No, this is not what I am trying to do. That approach was done in my
> first RFC, but I have since abandoned it, and I now have proper liner
> copy configured in TTBR1:
> See: https://lore.kernel.org/lkml/20191204155938.2279686-24-pasha.tatashin@soleen.com
> +/*
> + * Map source segments starting from KEXEC_SRC_START, and map destination
> + * segments starting from KEXEC_DST_START, and return size of copy in
> + * *copy_len argument.
> + * Relocation function essentially needs to do:
> + * memcpy(KEXEC_DST_START, KEXEC_SRC_START, copy_len);
> + */
>
> Sorry, I made a misleading comment that kexec needs to idmap many
> pages, in fact it actually needs to idmap only two pages with the
> current approach:
>
> 1. relocation function
> 2. relocation function argument
>
> I could fit both of them into a single pages (the relocation function
> body is tiny, and argument only contains 9 fields, so 72 bytes), it
> will be a little ugly though to have them setup like that, so if you
> have a better suggestion please let me know.

Nevermind. I figured we do not really need to idmap argument. In
arm64_relocate_new_kernel() while MMU is off we have plenty of
registers. I will simply load all argument arguments into free
registers before turning MMU on.

>
> > You only need one page idmaped so you can switch TTBR1_EL1, and turn the
> > MMU off.
> >
> >
> > You can do the copy of memory using a copy of the linear map in
> > TTBR1_EL1. For an example: hibernate does exactly this.
>
> Yes, this is exactly what I am currently doing.
>
> > The code in this RFC was particularly tricky to test as its behaviour
> > depends on which bits of a pointer are set.
> >
> > This code is complicated, and impossible to debug if it goes wrong.
> > (photograph of a screen with the word 'Bye' on it anyone?). Worse: it
> > must not introduce coherency issues into the next kernel.
> >
> > It must be as simple as possible. What you are proposing is not.
> >
>
> I agree. So, let me modify kexec to idmap exactly one page (I will
> stuff argument and body into a single page), and re-use it with
> hibernate as you proposed.
>
> Thank you,
> Pasha
James Morse March 25, 2020, 5:08 p.m. UTC | #5
Hi Pavel,

On 3/25/20 1:29 PM, Pavel Tatashin wrote:
>> You keep coming back to this because you are trying to idmap all memory
>> on arm64. You do not need to do this.
> 
> No, this is not what I am trying to do. That approach was done in my
> first RFC, but I have since abandoned it,

Ah, okay. These all merge into one!


>> You only need one page idmaped so you can switch TTBR1_EL1, and turn the
>> MMU off.
>>
>>
>> You can do the copy of memory using a copy of the linear map in
>> TTBR1_EL1. For an example: hibernate does exactly this.
> 
> Yes, this is exactly what I am currently doing.

Great!


Thanks,

James
diff mbox series

Patch

diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 7f8cb7596f9e..b0bceec829c7 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -183,58 +183,57 @@  int arch_hibernation_header_restore(void *addr)
 EXPORT_SYMBOL(arch_hibernation_header_restore);
 
 /*
- * Create a set of page tables that map page to dst_addr.
+ * Create a set of page tables that idmap phys_dst_addr.
  */
-static int create_single_mapping(unsigned long page, unsigned long dst_addr,
-				 phys_addr_t *phys_dst_addr,
+static int create_single_mapping(phys_addr_t phys_dst_addr,
 				 void *(*allocator)(gfp_t mask),
 				 gfp_t mask)
 {
 	int rc = 0;
-	pgd_t *trans_pgd;
-	pgd_t *pgdp;
-	pud_t *pudp;
-	pmd_t *pmdp;
-	pte_t *ptep;
-
-	trans_pgd = allocator(mask);
-	if (!trans_pgd) {
-		rc = -ENOMEM;
-		goto out;
-	}
+	unsigned long level_mask;
+	int this_level = 3, index;
+	unsigned long *levels[4] = { };
+	unsigned long prev_level_entry;
+	int bits_mapped = PAGE_SHIFT - 4;
+	unsigned int level_lsb, level_msb, max_msb;
+	unsigned long pfn = __phys_to_pfn(phys_dst_addr);
+
+	if (phys_dst_addr & GENMASK(52, 48))
+		max_msb = 51;
+	else
+		max_msb = 47;
 
-	pgdp = pgd_offset_raw(trans_pgd, dst_addr);
-	if (pgd_none(READ_ONCE(*pgdp))) {
-		pudp = allocator(mask);
-		if (!pudp) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		pgd_populate(&init_mm, pgdp, pudp);
-	}
+	/*
+	 * The page we want to idmap may be outside the range covered by
+	 * VA_BITS that can be built using the kernel's p?d_populate() helpers.
+	 *
+	 * As a one off, for a single page, we build these page tables bottom
+	 * up and just assume that will need the maximum T0SZ.
+	 */
+	phys_dst_addr &= PAGE_MASK;
+	prev_level_entry = pte_val(pfn_pte(pfn, PAGE_KERNEL_EXEC));
 
-	pudp = pud_offset(pgdp, dst_addr);
-	if (pud_none(READ_ONCE(*pudp))) {
-		pmdp = allocator(mask);
-		if (!pmdp) {
+	for (this_level = 3; this_level >= 0; this_level--) {
+		levels[this_level] = allocator(mask);
+		if (!levels[this_level]) {
 			rc = -ENOMEM;
 			goto out;
 		}
-		pud_populate(&init_mm, pudp, pmdp);
-	}
 
-	pmdp = pmd_offset(pudp, dst_addr);
-	if (pmd_none(READ_ONCE(*pmdp))) {
-		ptep = allocator(mask);
-		if (!ptep) {
-			rc = -ENOMEM;
-			goto out;
-		}
-		pmd_populate_kernel(&init_mm, pmdp, ptep);
-	}
+		level_lsb = ARM64_HW_PGTABLE_LEVEL_SHIFT(this_level);
+		level_msb = min(level_lsb + bits_mapped, max_msb);
+		level_mask = GENMASK_ULL(level_msb, level_lsb);
 
-	ptep = pte_offset_kernel(pmdp, dst_addr);
-	set_pte(ptep, pfn_pte(virt_to_pfn(page), PAGE_KERNEL_EXEC));
+		index = (phys_dst_addr & level_mask) >> level_lsb;
+		*(levels[this_level] + index) = prev_level_entry;
+
+		pfn = virt_to_pfn(levels[this_level]);
+		prev_level_entry = pte_val(pfn_pte(pfn,
+						   __pgprot(PMD_TYPE_TABLE)));
+
+		if (level_msb == max_msb)
+			break;
+	}
 
 	/*
 	 * Load our new page tables. A strict BBM approach requires that we
@@ -245,24 +244,24 @@  static int create_single_mapping(unsigned long page, unsigned long dst_addr,
 	 * page, but TLBs may contain stale ASID-tagged entries (e.g. for EFI
 	 * runtime services), while for a userspace-driven test_resume cycle it
 	 * points to userspace page tables (and we must point it at a zero page
-	 * ourselves). Elsewhere we only (un)install the idmap with preemption
-	 * disabled, so T0SZ should be as required regardless.
+	 * ourselves).
+	 *
+	 * We change T0SZ as part of installing the idmap. This is undone by
+	 * cpu_uninstall_idmap() in __cpu_suspend_exit().
 	 */
 	cpu_set_reserved_ttbr0();
 	local_flush_tlb_all();
-	write_sysreg(phys_to_ttbr(virt_to_phys(pgdp)), ttbr0_el1);
+	__cpu_set_tcr_t0sz(TCR_T0SZ(max_msb + 1));
+	write_sysreg(phys_to_ttbr(__pfn_to_phys(pfn)), ttbr0_el1);
 	isb();
 
-	*phys_dst_addr = virt_to_phys((void *)page);
-
 out:
 	return rc;
 }
 
 /*
  * Copies length bytes, starting at src_start into an new page,
- * perform cache maintentance, then maps it at the specified address low
- * address as executable.
+ * perform cache maintentance, then idmaps it.
  *
  * This is used by hibernate to copy the code it needs to execute when
  * overwriting the kernel text. This function generates a new set of page
@@ -272,7 +271,6 @@  static int create_single_mapping(unsigned long page, unsigned long dst_addr,
  * page system.
  */
 static int create_safe_exec_page(void *src_start, size_t length,
-				 unsigned long dst_addr,
 				 phys_addr_t *phys_dst_addr,
 				 void *(*allocator)(gfp_t mask),
 				 gfp_t mask)
@@ -281,12 +279,12 @@  static int create_safe_exec_page(void *src_start, size_t length,
 
 	if (!page)
 		return -ENOMEM;
+	*phys_dst_addr = virt_to_phys((void *)page);
 
 	memcpy((void *)page, src_start, length);
 	__flush_icache_range(page, page + length);
 
-	return create_single_mapping(page, dst_addr, phys_dst_addr,
-				     allocator, gfp_t mask)
+	return create_single_mapping(*phys_dst_addr, allocator, mask);
 }
 
 #define dcache_clean_range(start, end)	__flush_dcache_area(start, (end - start))
@@ -499,7 +497,6 @@  int swsusp_arch_resume(void)
 	void *zero_page;
 	size_t exit_size;
 	pgd_t *tmp_pg_dir;
-	phys_addr_t phys_hibernate_exit;
 	void __noreturn (*hibernate_exit)(phys_addr_t, phys_addr_t, void *,
 					  void *, phys_addr_t, phys_addr_t);
 
@@ -529,19 +526,13 @@  int swsusp_arch_resume(void)
 		goto out;
 	}
 
-	/*
-	 * Locate the exit code in the bottom-but-one page, so that *NULL
-	 * still has disastrous affects.
-	 */
-	hibernate_exit = (void *)PAGE_SIZE;
 	exit_size = __hibernate_exit_text_end - __hibernate_exit_text_start;
 	/*
 	 * Copy swsusp_arch_suspend_exit() to a safe page. This will generate
 	 * a new set of ttbr0 page tables and load them.
 	 */
 	rc = create_safe_exec_page(__hibernate_exit_text_start, exit_size,
-				   (unsigned long)hibernate_exit,
-				   &phys_hibernate_exit,
+				   (phys_addr_t *)&hibernate_exit,
 				   (void *)get_safe_page, GFP_ATOMIC);
 	if (rc) {
 		pr_err("Failed to create safe executable page for hibernate_exit code.\n");
@@ -561,7 +552,7 @@  int swsusp_arch_resume(void)
 	 * We can skip this step if we booted at EL1, or are running with VHE.
 	 */
 	if (el2_reset_needed()) {
-		phys_addr_t el2_vectors = phys_hibernate_exit;  /* base */
+		phys_addr_t el2_vectors = (phys_addr_t)hibernate_exit;/* base */
 		el2_vectors += hibernate_el2_vectors -
 			       __hibernate_exit_text_start;     /* offset */