diff mbox

arm64: Boot failure on m400 with new cont PTEs

Message ID 20151123155132.GC32300@e104818-lin.cambridge.arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Catalin Marinas Nov. 23, 2015, 3:51 p.m. UTC
On Wed, Nov 18, 2015 at 04:29:32PM +0000, Mark Rutland wrote:
> FWIW, Will had a patch [1] for detecting PTE level break-before-make
> violations. I gave this a go on Juno with v4.4-rc1, and saw an issue in
> the EFI virtmap code that I'm currently investigating.
[...]
> [1] https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=aarch64/devel&id=372f39220ad35fa39a75419f2221ffeb6ffd78d3

I thought about adding a check for when we change from contiguous to
non-contiguous or vice-versa, on top of Will's patch:

--------------8<--------------------
--------------8<--------------------

Jeremy, can you give this fixup_init() patch a try, see whether it makes
any difference (it's just a temporary hack which may prevent us from
reverting the PTE_CONT patches until we have a better solution).

Anyway, I think we should merge Will's patch since it's a handy debug
tool.

Comments

Jeremy Linton Nov. 23, 2015, 4:02 p.m. UTC | #1
On 11/23/2015 09:51 AM, Catalin Marinas wrote:
> Call trace:
> [<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
> [<ffffffc0000954ec>] fixup_init+0x64/0x80
> [<ffffffc0000945a4>] free_initmem+0xc/0x38
> [<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
> [<ffffffc000085c50>] ret_from_fork+0x10/0x40
>
> What I don't get is why we have fixup_init() even when
> !CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
> a non-executable init section. However, the other sections are left
> executable when this config option is disabled. The patch below fixes
> the warnings above:

	Well the kernel permissions are a bit of a mess, and not at all 
"secure" in their current state. But I guess the thought must have been 
that turning off execute on the init sections was a good way to find 
functions incorrectly marked _init(). Which is different from RO. 
Frankly, I expect someone will push to cleanup the kernel permissions at 
some point (I've got it on my "spare time todo, list"), but this will of 
course make the create_mapping_late a lot more popular as I see it being 
called during module load/unload.
	Anyway, I've been saying the problem is create_mapping_late() all 
along, as you notice there isn't any TLB flushes in fixup_init() and 
that is the core of the problem, not all this other discussion.

> Jeremy, can you give this fixup_init() patch a try, see whether it makes
> any difference (it's just a temporary hack which may prevent us from
> reverting the PTE_CONT patches until we have a better solution).

I will try it, but i'm pretty sure that fixes it too.
Laura Abbott Nov. 23, 2015, 4:37 p.m. UTC | #2
On 11/23/15 8:02 AM, Jeremy Linton wrote:
> On 11/23/2015 09:51 AM, Catalin Marinas wrote:
>> Call trace:
>> [<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
>> [<ffffffc0000954ec>] fixup_init+0x64/0x80
>> [<ffffffc0000945a4>] free_initmem+0xc/0x38
>> [<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
>> [<ffffffc000085c50>] ret_from_fork+0x10/0x40
>>
>> What I don't get is why we have fixup_init() even when
>> !CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
>> a non-executable init section. However, the other sections are left
>> executable when this config option is disabled. The patch below fixes
>> the warnings above:
>
>      Well the kernel permissions are a bit of a mess, and not at all
> "secure" in their current state. But I guess the thought must have been
> that turning off execute on the init sections was a good way to find
> functions incorrectly marked _init(). Which is different from RO.
> Frankly, I expect someone will push to cleanup the kernel permissions at
> some point (I've got it on my "spare time todo, list"), but this will of
> course make the create_mapping_late a lot more popular as I see it being
> called during module load/unload.
>      Anyway, I've been saying the problem is create_mapping_late() all
> along, as you notice there isn't any TLB flushes in fixup_init() and
> that is the core of the problem, not all this other discussion.
>

fixup_init was deliberately designed to change the sections even if
DEBUG_RODATA was not enabled. This was partially designed to match the
behavior of arm(32) which also drops the nx bit but also good practice
in general for security.

Which permissions still need to be cleaned up from your perspective?

Thanks,
Laura
Jeremy Linton Nov. 23, 2015, 4:42 p.m. UTC | #3
On 11/23/2015 10:37 AM, Laura Abbott wrote:
> Which permissions still need to be cleaned up from your perspective?

IMHO, the vast majority of the linear map should not be 
executable/read/write, which is what happens today.
Laura Abbott Nov. 23, 2015, 5:52 p.m. UTC | #4
On 11/23/15 8:42 AM, Jeremy Linton wrote:
> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>> Which permissions still need to be cleaned up from your perspective?
>
> IMHO, the vast majority of the linear map should not be executable/read/write, which is what happens today.
>

With CONFIG_DEBUG_RODATA on I'm not seeing any rwx regions

# cat /sys/kernel/debug/kernel_page_tables
---[ vmalloc() Area ]---
0xffffff8000000000-0xffffff8000010000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000012000-0xffffff8000013000           4K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000018000-0xffffff8000019000           4K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000020000-0xffffff8000030000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000031000-0xffffff8000071000         256K     RW NX SHD AF            UXN MEM/NORMAL-NC
0xffffff8000080000-0xffffff8000180000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff80001ae000-0xffffff80001b1000          12K     RW NX SHD AF            UXN MEM/NORMAL
0xffffff8000200000-0xffffff8000300000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000380000-0xffffff8000480000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000500000-0xffffff8000600000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000680000-0xffffff8000780000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000800000-0xffffff8000900000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000980000-0xffffff8000a80000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000b00000-0xffffff8000c00000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000c80000-0xffffff8000d80000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000e00000-0xffffff8000f00000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8000f80000-0xffffff8001080000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001100000-0xffffff8001200000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001280000-0xffffff8001380000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001400000-0xffffff8001500000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001580000-0xffffff8001680000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
0xffffff8001700000-0xffffff8001800000           1M     RW NX SHD AF            UXN DEVICE/nGnRE
---[ vmalloc() End ]---
---[ vmemmap start ]---
0xffffffbdc1000000-0xffffffbdc3000000          32M     RW NX SHD AF        BLK UXN MEM/NORMAL
---[ vmemmap end ]---
---[ Fixmap start ]---
0xffffffbffa800000-0xffffffbffaa00000           2M     ro NX SHD AF        BLK UXN MEM/NORMAL
---[ Fixmap end ]---
---[ PCI I/O start ]---
0xffffffbffae00000-0xffffffbffae10000          64K     RW NX SHD AF            UXN DEVICE/nGnRE
---[ PCI I/O end ]---
---[ Modules start ]---
---[ Modules end ]---
---[ Kernel Mapping ]---
0xffffffc000000000-0xffffffc000080000         512K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000080000-0xffffffc000082000           8K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000082000-0xffffffc000090000          56K     ro x  SHD AF            UXN MEM/NORMAL
0xffffffc000090000-0xffffffc000200000        1472K     ro x  SHD AF    CON     UXN MEM/NORMAL
0xffffffc000200000-0xffffffc000800000           6M     ro x  SHD AF        BLK UXN MEM/NORMAL
0xffffffc000800000-0xffffffc000980000        1536K     ro x  SHD AF    CON     UXN MEM/NORMAL
0xffffffc000980000-0xffffffc000987000          28K     ro x  SHD AF            UXN MEM/NORMAL
0xffffffc000987000-0xffffffc000990000          36K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000990000-0xffffffc000a00000         448K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000a00000-0xffffffc000a10000          64K     RW NX SHD AF            UXN MEM/NORMAL
0xffffffc000a10000-0xffffffc000c00000        1984K     RW NX SHD AF    CON     UXN MEM/NORMAL
0xffffffc000c00000-0xffffffc040000000        1012M     RW NX SHD AF        BLK UXN MEM/NORMAL
0xffffffc040000000-0xffffffc080000000           1G     RW NX SHD AF        BLK UXN MEM/NORMAL

are there other parts of the map you're seeing that aren't on my system?

Thanks,
Laura
Jeremy Linton Nov. 23, 2015, 6:46 p.m. UTC | #5
On 11/23/2015 11:52 AM, Laura Abbott wrote:
> On 11/23/15 8:42 AM, Jeremy Linton wrote:
>> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>>> Which permissions still need to be cleaned up from your perspective?
>>
>> IMHO, the vast majority of the linear map should not be
>> executable/read/write, which is what happens today.
>>
>
> With CONFIG_DEBUG_RODATA on I'm not seeing any rwx regions

> 0xffffffc000000000-0xffffffc000080000         512K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000080000-0xffffffc000082000           8K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000082000-0xffffffc000090000          56K     ro x  SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000090000-0xffffffc000200000        1472K     ro x  SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000200000-0xffffffc000800000           6M     ro x  SHD
> AF        BLK UXN MEM/NORMAL
> 0xffffffc000800000-0xffffffc000980000        1536K     ro x  SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000980000-0xffffffc000987000          28K     ro x  SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000987000-0xffffffc000990000          36K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000990000-0xffffffc000a00000         448K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000a00000-0xffffffc000a10000          64K     RW NX SHD
> AF            UXN MEM/NORMAL
> 0xffffffc000a10000-0xffffffc000c00000        1984K     RW NX SHD AF
> CON     UXN MEM/NORMAL
> 0xffffffc000c00000-0xffffffc040000000        1012M     RW NX SHD
> AF        BLK UXN MEM/NORMAL
> 0xffffffc040000000-0xffffffc080000000           1G     RW NX SHD
> AF        BLK UXN MEM/NORMAL
>
> are there other parts of the map you're seeing that aren't on my system?


Thats my mistake, I didn't remember seeing the NX set for the block 
mappings with the RO case, the machine I just checked looks good too. 
So, I you can ignore that little rant.
Ard Biesheuvel Nov. 24, 2015, 8:04 a.m. UTC | #6
On 23 November 2015 at 17:42, Jeremy Linton <jlinton@redhat.com> wrote:
> On 11/23/2015 10:37 AM, Laura Abbott wrote:
>>
>> Which permissions still need to be cleaned up from your perspective?
>
>
> IMHO, the vast majority of the linear map should not be
> executable/read/write, which is what happens today.
>

IMHO, this should not be hidden under DEBUG_RODATA (which some people
may mistake for a debug option, while others may assume it affects
rodata only) since the kernel stacks are executable without it.
diff mbox

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index defbfde43a43..70e02e3b1a78 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -240,6 +240,10 @@  static inline void set_pte(pte_t *ptep, pte_t pte)
 	if (WARN_ON((old & PTE_ATTRINDX_MASK) != (new & PTE_ATTRINDX_MASK)))
 		goto pte_bad;
 
+	/* Changing contiguity may lead to a TLB conflict */
+	if (WARN_ON((old & PTE_CONT) != (new & PTE_CONT)))
+		goto pte_bad;
+
 	/* Change of OA is only an issue if one mapping is writable */
 	if (!(old & new & PTE_RDONLY) &&
 	    WARN_ON(pte_pfn(*ptep) != pte_pfn(pte)))
--------------8<--------------------

But it doesn't look nice afterwards. It's the fixup_init() trying to
re-write the entries and we start changing the PTE_CONT bit:

Call trace:
[<ffffffc0000952b8>] __create_mapping.isra.5+0x360/0x530
[<ffffffc0000954ec>] fixup_init+0x64/0x80
[<ffffffc0000945a4>] free_initmem+0xc/0x38
[<ffffffc0005eb9f8>] kernel_init+0x20/0xe0
[<ffffffc000085c50>] ret_from_fork+0x10/0x40

What I don't get is why we have fixup_init() even when
!CONFIG_DEBUG_RODATA. It probably breaks the initial mapping just to get
a non-executable init section. However, the other sections are left
executable when this config option is disabled. The patch below fixes
the warnings above:

--------------8<--------------------
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index abb66f84d4ac..e0f82e1a1c09 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -482,9 +482,11 @@  void mark_rodata_ro(void)
 
 void fixup_init(void)
 {
+#ifdef CONFIG_DEBUG_RODATA
 	create_mapping_late(__pa(__init_begin), (unsigned long)__init_begin,
 			(unsigned long)__init_end - (unsigned long)__init_begin,
 			PAGE_KERNEL);
+#endif
 }
 
 /*