diff mbox series

target/i386: Fix physical address truncation when PAE is enabled

Message ID 0102018c7d0026c1-fd5f0b50-48fd-4552-be0a-cbb6070b5e14-000000@eu-west-1.amazonses.com (mailing list archive)
State New, archived
Headers show
Series target/i386: Fix physical address truncation when PAE is enabled | expand

Commit Message

Michael Brown Dec. 18, 2023, 12:56 p.m. UTC
The address translation logic in get_physical_address() will currently
truncate physical addresses to 32 bits unless long mode is enabled.
This is incorrect when using physical address extensions (PAE) outside
of long mode, with the result that a 32-bit operating system using PAE
to access memory above 4G will experience undefined behaviour.

The truncation code was originally introduced in commit 33dfdb5 ("x86:
only allow real mode to access 32bit without LMA"), where it applied
only to translations performed while paging is disabled (and so cannot
affect guests using PAE).

Commit 9828198 ("target/i386: Add MMU_PHYS_IDX and MMU_NESTED_IDX")
rearranged the code such that the truncation also applied to the use
of MMU_PHYS_IDX and MMU_NESTED_IDX.  Commit 4a1e9d4 ("target/i386: Use
atomic operations for pte updates") brought this truncation into scope
for page table entry accesses, and is the first commit for which a
Windows 10 32-bit guest will reliably fail to boot if memory above 4G
is present.

Fix by testing for PAE being enabled via the relevant bit in CR4,
instead of testing for long mode being enabled.  PAE must be enabled
as a prerequisite of long mode, and so this is a generalisation of the
current test.

Remove the #ifdef TARGET_X86_64 check since PAE exists in both 32-bit
and 64-bit processors, and both should exhibit the same truncation
behaviour when PAE is disabled.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2040
Signed-off-by: Michael Brown <mcb30@ipxe.org>
---
 target/i386/tcg/sysemu/excp_helper.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Richard Henderson Dec. 20, 2023, 4:22 a.m. UTC | #1
On 12/18/23 23:56, Michael Brown wrote:
> The address translation logic in get_physical_address() will currently
> truncate physical addresses to 32 bits unless long mode is enabled.
> This is incorrect when using physical address extensions (PAE) outside
> of long mode, with the result that a 32-bit operating system using PAE
> to access memory above 4G will experience undefined behaviour.
> 
> The truncation code was originally introduced in commit 33dfdb5 ("x86:
> only allow real mode to access 32bit without LMA"), where it applied
> only to translations performed while paging is disabled (and so cannot
> affect guests using PAE).
> 
> Commit 9828198 ("target/i386: Add MMU_PHYS_IDX and MMU_NESTED_IDX")
> rearranged the code such that the truncation also applied to the use
> of MMU_PHYS_IDX and MMU_NESTED_IDX.  Commit 4a1e9d4 ("target/i386: Use
> atomic operations for pte updates") brought this truncation into scope
> for page table entry accesses, and is the first commit for which a
> Windows 10 32-bit guest will reliably fail to boot if memory above 4G
> is present.
> 
> Fix by testing for PAE being enabled via the relevant bit in CR4,
> instead of testing for long mode being enabled.  PAE must be enabled
> as a prerequisite of long mode, and so this is a generalisation of the
> current test.
> 
> Remove the #ifdef TARGET_X86_64 check since PAE exists in both 32-bit
> and 64-bit processors, and both should exhibit the same truncation
> behaviour when PAE is disabled.
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2040
> Signed-off-by: Michael Brown <mcb30@ipxe.org>
> ---
>   target/i386/tcg/sysemu/excp_helper.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
> index 5b86f439ad..3d0d0d78d7 100644
> --- a/target/i386/tcg/sysemu/excp_helper.c
> +++ b/target/i386/tcg/sysemu/excp_helper.c
> @@ -582,12 +582,10 @@ static bool get_physical_address(CPUX86State *env, vaddr addr,
>   
>       /* Translation disabled. */
>       out->paddr = addr & x86_get_a20_mask(env);
> -#ifdef TARGET_X86_64
> -    if (!(env->hflags & HF_LMA_MASK)) {
> -        /* Without long mode we can only address 32bits in real mode */
> +    if (!(env->cr[4] & CR4_PAE_MASK)) {
> +        /* Without PAE we can address only 32 bits */
>           out->paddr = (uint32_t)out->paddr;
>       }
> -#endif

This is not the correct refactoring.

I agree that what we're currently doing is wrong, esp for MMU_PHYS_IDX, but for the 
default case, if CR0.PG == 0, then CR4.PAE is ignored (vol 3, section 4.1.1).

I suspect the correct fix is to have MMU_PHYS_IDX pass through the input address 
unchanged, and it is the responsibility of the higher level paging mmu_idx to truncate 
physical addresses per PG_MODE_* before recursing.


r~

r~


>       out->prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
>       out->page_size = TARGET_PAGE_SIZE;
>       return true;
Michael Brown Dec. 20, 2023, 11:03 a.m. UTC | #2
On 20/12/2023 04:22, Richard Henderson wrote:
> On 12/18/23 23:56, Michael Brown wrote:
>> The address translation logic in get_physical_address() will currently
>> truncate physical addresses to 32 bits unless long mode is enabled.
>> This is incorrect when using physical address extensions (PAE) outside
>> of long mode, with the result that a 32-bit operating system using PAE
>> to access memory above 4G will experience undefined behaviour.
 >>
 >> <snip>
>>
>> --- a/target/i386/tcg/sysemu/excp_helper.c
>> +++ b/target/i386/tcg/sysemu/excp_helper.c
>> @@ -582,12 +582,10 @@ static bool get_physical_address(CPUX86State 
>> *env, vaddr addr,
>>       /* Translation disabled. */
>>       out->paddr = addr & x86_get_a20_mask(env);
>> -#ifdef TARGET_X86_64
>> -    if (!(env->hflags & HF_LMA_MASK)) {
>> -        /* Without long mode we can only address 32bits in real mode */
>> +    if (!(env->cr[4] & CR4_PAE_MASK)) {
>> +        /* Without PAE we can address only 32 bits */
>>           out->paddr = (uint32_t)out->paddr;
>>       }
>> -#endif
> 
> This is not the correct refactoring.
> 
> I agree that what we're currently doing is wrong, esp for MMU_PHYS_IDX, 
> but for the default case, if CR0.PG == 0, then CR4.PAE is ignored (vol 
> 3, section 4.1.1).
> 
> I suspect the correct fix is to have MMU_PHYS_IDX pass through the input 
> address unchanged, and it is the responsibility of the higher level 
> paging mmu_idx to truncate physical addresses per PG_MODE_* before 
> recursing.

Thank you for reviewing, and for confirming the bug.

For the MMU_PHYS_IDX case, I agree that it makes sense for the address 
to be passed through unchanged.

For the default case, I think it would make sense to unconditionally 
truncate the address to 32 bits if paging is disabled.  (I am not sure 
why the original commit 33dfdb5 included a test for long mode, since I 
do not see how it is possible to get the CPU into long mode with paging 
disabled.)

I do not know what ought to be done in the MMU_NESTED_IDX case, and 
would appreciate your input on this.

Thanks,

Michael
Richard Henderson Dec. 20, 2023, 9:51 p.m. UTC | #3
On 12/20/23 22:03, Michael Brown wrote:
> For the default case, I think it would make sense to unconditionally truncate the address 
> to 32 bits if paging is disabled.  (I am not sure why the original commit 33dfdb5 included 
> a test for long mode, since I do not see how it is possible to get the CPU into long mode 
> with paging disabled.)

You are correct that paging is mandatory for LMA -- indeed, setting CR0.PG is the final 
step in 10.8.5 Initializing IA-32e Mode, which copies EFER.LME to EFER.LMA.

The commit 33dfdb5 that you reference is definitely wrong.

> I do not know what ought to be done in the MMU_NESTED_IDX case, and would appreciate your 
> input on this.

It would be ideal if MMU_NESTED_IDX were only used when virtualization is enabled and 
nested_ctl & SVM_NPT_ENABLED, i.e. when use_stage2 is true.

I can't remember why I added a check for use_stage2 for MMU_NESTED_IDX in 98281984a37.  It 
is possible that, as that patch introduced the feature, I was being cautious.  It's also 
possible that I did see something, but then it was cleaned up somewhere in the rest of 
that rather large patch series.

In any case, current intended behaviour is that MMU_NESTED_IDX with !use_stage2 equates to 
MMU_PHYS_IDX.


r~
Michael Brown Dec. 21, 2023, 3:20 p.m. UTC | #4
On 20/12/2023 21:51, Richard Henderson wrote:
> On 12/20/23 22:03, Michael Brown wrote:
>> For the default case, I think it would make sense to unconditionally 
>> truncate the address to 32 bits if paging is disabled.  (I am not sure 
>> why the original commit 33dfdb5 included a test for long mode, since I 
>> do not see how it is possible to get the CPU into long mode with 
>> paging disabled.)
> 
> You are correct that paging is mandatory for LMA -- indeed, setting 
> CR0.PG is the final step in 10.8.5 Initializing IA-32e Mode, which 
> copies EFER.LME to EFER.LMA.
> 
> The commit 33dfdb5 that you reference is definitely wrong.

I have done some further investigation, and come to the conclusion that 
we should just delete the truncation code entirely.

With paging disabled, there is (as far as I know) no way to execute an 
instruction with an address size of 64 bits.  With the instruction 
address size being 32 bits, the linear address will already be truncated 
to 32 bits anyway.

A quick userspace test program confirms that on a physical CPU a 32-bit 
address size will result in the linear address being truncated to 32 bits:

     #include <stdint.h>
     #include <stdio.h>

     uint8_t val = 42;

     int main ( void ) {
	uint8_t *ptr = ( &val + 0x80000001 );
	uint8_t res;

	printf ( "&val = %p\n", &val );  // will be around 0x400000
	printf ( "ptr = %p\n", ptr );

	printf ( "addr32 read via 0x7fffffff(%p)...\n", ptr );
	__asm__ ( "addr32 movb 0x7fffffff(%k1), %b0\n\t"
		  : "=r" ( res ) : "r" ( ptr ) );
	printf ( "...got %d\n", res );

	printf ( "addr64 read via 0x7fffffff(%p)...\n", ptr );
	__asm__ ( "movb 0x7fffffff(%1), %b0\n\t"
		  : "=r" ( res ) : "r" ( ptr ) );
	printf ( "...got %d\n", res );

	return 0;
     }

produces the expected output:

$ cc -o test test.c && ./test
&val = 0x40400c
ptr = 0x8040400d
addr32 read via 0x7fffffff(0x8040400d)...
...got 42
addr64 read via 0x7fffffff(0x8040400d)...
Segmentation fault (core dumped)

which I believe shows that the addr32 instruction experiences wraparound 
of the linear address (and so accesses the correct location), while the 
addr64 instruction with the same base and offset does not experience 
wraparound of the linear address (and so segfaults).

The original commit 33dfdb5 describes 32-bit code relying on the linear 
address wraparound to access memory below the segment base.  This is 
what iPXE also does: the iPXE code is physically copied to somewhere 
high in 32-bit memory, the segment base is set to point to the location 
of iPXE (thereby avoiding the need to apply relocation records), and 
wraparound is relied upon to access all memory below this.  I have 
tested removing the truncation code from get_physical_address() and 
verified that iPXE continues to work as expected.

I have also verified that removing the truncation code from 
get_physical_address() does seem to fix the problem with PAE mode (i.e. 
is able to boot Windows 10 with RAM above 4G).

I will send a v2 patch shortly.

Thanks,

Michael
diff mbox series

Patch

diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 5b86f439ad..3d0d0d78d7 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -582,12 +582,10 @@  static bool get_physical_address(CPUX86State *env, vaddr addr,
 
     /* Translation disabled. */
     out->paddr = addr & x86_get_a20_mask(env);
-#ifdef TARGET_X86_64
-    if (!(env->hflags & HF_LMA_MASK)) {
-        /* Without long mode we can only address 32bits in real mode */
+    if (!(env->cr[4] & CR4_PAE_MASK)) {
+        /* Without PAE we can address only 32 bits */
         out->paddr = (uint32_t)out->paddr;
     }
-#endif
     out->prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
     out->page_size = TARGET_PAGE_SIZE;
     return true;