diff mbox series

[qemu] x86: don't let decompressed kernel image clobber setup_data

Message ID 20221228143831.396245-1-Jason@zx2c4.com (mailing list archive)
State New, archived
Headers show
Series [qemu] x86: don't let decompressed kernel image clobber setup_data | expand

Commit Message

Jason A. Donenfeld Dec. 28, 2022, 2:38 p.m. UTC
The setup_data links are appended to the compressed kernel image. Since
the kernel image is typically loaded at 0x100000, setup_data lives at
`0x100000 + compressed_size`, which does not get relocated during the
kernel's boot process.

The kernel typically decompresses the image starting at address
0x1000000 (note: there's one more zero there than the decompressed image
above). This usually is fine for most kernels.

However, if the compressed image is actually quite large, then
setup_data will live at a `0x100000 + compressed_size` that extends into
the decompressed zone at 0x1000000. In other words, if compressed_size
is larger than `0x1000000 - 0x100000`, then the decompression step will
clobber setup_data, resulting in crashes.

Fix this by detecting that possibility, and if it occurs, put setup_data
*after* the end of the decompressed kernel, so that it doesn't get
clobbered.

One caveat is that this only works for images less than around 64
megabytes, so just bail out in that case. This is unfortunate, but I
don't currently have a way of fixing it.

Cc: x86@kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

Comments

Philippe Mathieu-Daudé Dec. 28, 2022, 4:02 p.m. UTC | #1
Hi Jason,

On 28/12/22 15:38, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the decompressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Fix this by detecting that possibility, and if it occurs, put setup_data
> *after* the end of the decompressed kernel, so that it doesn't get
> clobbered.
> 
> One caveat is that this only works for images less than around 64
> megabytes, so just bail out in that case. This is unfortunate, but I
> don't currently have a way of fixing it.
> 
> Cc: x86@kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>   hw/i386/x86.c | 30 ++++++++++++++++++++++++++++++
>   1 file changed, 30 insertions(+)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 78cc131926..628fd2b2e9 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1077,6 +1077,36 @@ void x86_load_linux(X86MachineState *x86ms,
>       }
>       fclose(f);
>   
> +    /* If a setup_data is going to be used, make sure that the bootloader won't
> +     * decompress into it and clobber those bytes. */
> +    if (dtb_filename || !legacy_no_rng_seed) {
> +        uint32_t payload_offset = ldl_p(setup + 0x248);
> +        uint32_t payload_length = ldl_p(setup + 0x24c);
> +        uint32_t target_address = ldl_p(setup + 0x258);

Nitpicking, can the Linux kernel add these magic values in
arch/x86/include/uapi/asm/bootparam.h? Or can we use
offsetof(setup_header) to get them?

> +        uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4);
> +
> +        uint32_t estimated_setup_data_length = 4096 * 16;
> +        uint32_t start_setup_data = prot_addr + kernel_size;
> +        uint32_t end_setup_data = start_setup_data + estimated_setup_data_length;
> +        uint32_t start_target = target_address;
> +        uint32_t end_target = target_address + decompressed_length;

Maybe we can simply use 'unsigned' type.

> +        if ((start_setup_data >= start_target && start_setup_data < end_target) ||
> +            (end_setup_data >= start_target && end_setup_data < end_target)) {
> +            uint32_t padded_size = target_address + decompressed_length - prot_addr;
> +
> +            /* The early stage can't address past around 64 MB from the original
> +             * mapping, so just give up in that case. */
> +            if (padded_size < 62 * 1024 * 1024)

You mention 64 but check for 62, is that expected? You can use the MiB
definitions to ease code review: 64 * MiB.

> +                kernel_size = padded_size;
> +            else {
> +                fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
> +                dtb_filename = NULL;
> +                legacy_no_rng_seed = true;
> +            }
> +        }
> +    }
Fix looks good, glad you figured out the problem.

Regards,

Phil.
Jason A. Donenfeld Dec. 28, 2022, 4:30 p.m. UTC | #2
On Wed, Dec 28, 2022 at 05:02:22PM +0100, Philippe Mathieu-Daudé wrote:
> Hi Jason,
> 
> On 28/12/22 15:38, Jason A. Donenfeld wrote:
> > The setup_data links are appended to the compressed kernel image. Since
> > the kernel image is typically loaded at 0x100000, setup_data lives at
> > `0x100000 + compressed_size`, which does not get relocated during the
> > kernel's boot process.
> > 
> > The kernel typically decompresses the image starting at address
> > 0x1000000 (note: there's one more zero there than the decompressed image
*compressed image

> > +        uint32_t target_address = ldl_p(setup + 0x258);
> 
> Nitpicking, can the Linux kernel add these magic values in
> arch/x86/include/uapi/asm/bootparam.h? Or can we use
> offsetof(setup_header) to get them?

I suspect the reason that x86.c has always had those hardcoded offsets
is because this is how it's documented in Documentation/x86/boot.rst?

> > +        if ((start_setup_data >= start_target && start_setup_data < end_target) ||
> > +            (end_setup_data >= start_target && end_setup_data < end_target)) {
> > +            uint32_t padded_size = target_address + decompressed_length - prot_addr;
> > +
> > +            /* The early stage can't address past around 64 MB from the original
> > +             * mapping, so just give up in that case. */
> > +            if (padded_size < 62 * 1024 * 1024)
> 
> You mention 64 but check for 62, is that expected? You can use the MiB
> definitions to ease code review: 64 * MiB.

62 is intentional. But I'm still not really sure what's up. 63 doesn't
work. I haven't totally worked out why this is, or why the 64 MiB limit
exists in the first place. Maybe because this is a very early mapping
set up by real mode? Or because another mapping is placed over it that's
executable? There's that 2MiB*4096 gdt entry, but that'd cover all 4
gigs. So I really don't know yet. I'll continue to poke at it, but on
the off chance somebody here understands what's up, that'd save me a
bunch of head scratching.

> Fix looks good, glad you figured out the problem.

I mean, kind of. The solution here sucks, especially given that in the
worst case, setup_data just gets dropped. I'm half inclined to consider
this a kernel bug instead, and add some code to relocate setup_data
prior to decompression, and then fix up all the links. It seems like
this would be a lot more robust.

I just wish the people who wrote this stuff would chime in. I've had
x86@kernel.org CC'd but so far, no input from them.

Jason
Jason A. Donenfeld Dec. 28, 2022, 4:57 p.m. UTC | #3
HELLO H. PETER ANVIN,
E
L
L
O

On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
> > Fix looks good, glad you figured out the problem.
> 
> I mean, kind of. The solution here sucks, especially given that in the
> worst case, setup_data just gets dropped. I'm half inclined to consider
> this a kernel bug instead, and add some code to relocate setup_data
> prior to decompression, and then fix up all the links. It seems like
> this would be a lot more robust.
> 
> I just wish the people who wrote this stuff would chime in. I've had
> x86@kernel.org CC'd but so far, no input from them.

Apparently you are the x86 boot guru. What do you want to happen here?
Your input would be very instrumental.

Jason
H. Peter Anvin Dec. 28, 2022, 11:58 p.m. UTC | #4
On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>HELLO H. PETER ANVIN,
>E
>L
>L
>O
>
>On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
>> > Fix looks good, glad you figured out the problem.
>> 
>> I mean, kind of. The solution here sucks, especially given that in the
>> worst case, setup_data just gets dropped. I'm half inclined to consider
>> this a kernel bug instead, and add some code to relocate setup_data
>> prior to decompression, and then fix up all the links. It seems like
>> this would be a lot more robust.
>> 
>> I just wish the people who wrote this stuff would chime in. I've had
>> x86@kernel.org CC'd but so far, no input from them.
>
>Apparently you are the x86 boot guru. What do you want to happen here?
>Your input would be very instrumental.
>
>Jason

Hi!

Glad you asked.

So the kernel load addresses are parameterized in the kernel image setup header. One of the things that are so parameterized are the size and possible realignment of the kernel image in memory.

I'm very confused where you are getting the 64 MB number from. There should not be any such limitation.

In general, setup_data should be able to go anywhere the initrd can go, and so is subject to the same address cap (896 MB for old kernels, 4 GB on newer ones; this address too is enumerated in the header.)

If you want to put setup_data above 4 GB, it *should* be ok if and only if the kernel supports loading the initrd high, too (again, enumerated in the header.

TL;DR: put setup_data where you put the initrd (before or after doesn't matter.)

To be maximally conservative, link the setup_data list in order from lowest to highest address; currently there is no such item of relevance, but in the future there may be setup_data items needed by the BIOS part of the bootstrap in which case they would have to be < 1 MB and precede any items > 1 MB for obvious reasons. That being said, with BIOS dying it is not all that likely that such entries will ever be needed.
H. Peter Anvin Dec. 29, 2022, 2:13 a.m. UTC | #5
On 12/28/22 15:58, H. Peter Anvin wrote:
> On December 28, 2022 8:57:54 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>> HELLO H. PETER ANVIN,
>> E
>> L
>> L
>> O
>>
>> On Wed, Dec 28, 2022 at 05:30:30PM +0100, Jason A. Donenfeld wrote:
>>>> Fix looks good, glad you figured out the problem.
>>>
>>> I mean, kind of. The solution here sucks, especially given that in the
>>> worst case, setup_data just gets dropped. I'm half inclined to consider
>>> this a kernel bug instead, and add some code to relocate setup_data
>>> prior to decompression, and then fix up all the links. It seems like
>>> this would be a lot more robust.
>>>
>>> I just wish the people who wrote this stuff would chime in. I've had
>>> x86@kernel.org CC'd but so far, no input from them.
>>
>> Apparently you are the x86 boot guru. What do you want to happen here?
>> Your input would be very instrumental.
>>
>> Jason
> 
> Hi!
> 
> Glad you asked.
> 
> So the kernel load addresses are parameterized in the kernel image
> setup header. One of the things that are so parameterized are the
> size and possible realignment of the kernel image in memory.
> 
> I'm very confused where you are getting the 64 MB number from. There
> should not be any such limitation.
> 
> In general, setup_data should be able to go anywhere the initrd can
> go, and so is subject to the same address cap (896 MB for old
> kernels, 4 GB on newer ones; this address too is enumerated in the
> header.)
> 
> If you want to put setup_data above 4 GB, it *should* be ok if and
> only if the kernel supports loading the initrd high, too (again,
> enumerated in the header.
> 
> TL;DR: put setup_data where you put the initrd (before or after
> doesn't matter.)
> 
> To be maximally conservative, link the setup_data list in order from
> lowest to highest address; currently there is no such item of
> relevance, but in the future there may be setup_data items needed by
> the BIOS part of the bootstrap in which case they would have to be <
> 1 MB and precede any items > 1 MB for obvious reasons. That being
> said, with BIOS dying it is not all that likely that such entries
> will ever be needed.
> 

So let me try for an algorithm. Attached as a text file to avoid line 
break damage.

	-hpa
Here is an attempted description with pseudo-C code:

First of all, take a 4K page of memory and *initialize it to zero*.
{
    #include <asm/bootparam.h>	/* From the uapi kernel sources */

    /* Allocated somewhere in your code... */
    extern unsigned char *kernel_image;		/* Kernel file */
    extern struct boot_params *boot_params;	/* 4K buffer */
    extern uint32_t kernel_image_size;		/* Size of kernel file */

    /* Callbacks into your code */
    extern bool is_bios_boot(void);
    extern uint32_t end_of_low_memory(void); /* For BIOS boot */
    /*
     * This MUST return an alignment address between start_address
     * and max_address...
     */
    extern uint64_t maybe_relocate_kernel(uint64_t start_address,
	  uint64_t max_address, uint32_t alignment);

    /*
     * Convenience pointer into the kernel image; modifications
     * done here should be reflected in the loaded kernel image
     */
    struct setup_header * const kernel_setup_header =
	(struct setup_header *)(kernel_image + 0x1f1);

    /* Initialize boot_params to zero!!! */
    memset(boot_params, 0, sizeof *boot_params);
}

Copy the setup header starting at file offset 0x1f1 to offset 0x1f1
into that page:
{
    int setup_length =
	kernel_setup_header->header == 0x53726448
	? (kernel_setup_header->jump >> 8) + 17 : 15;

    memcpy(&boot_params->hdr, kernel_setup_header, setup_length);
}

Now you can compute values including ones are omitted by older kernels:
{
    /*
     * Split between the part of the kernel to be loaded into
     * low memory (for 16-bit boot, otherwise it can be safely
     * omitted) and the part to be loaded into high memory.
     */
    if (!boot_params->hdr.setup_sects)
	boot_param->hdr.setup_sects = 4;

    int high_kernel_start = (boot_param->hdr.setup_sects+1) << 9;

    /*
     * Highest permitted address for the high part of the kernel image,
     * initrd, command line (*except for 16-bit boot*), and setup_data
     *
     * max_initrd_addr here is exclusive
     */
    uint64_t max_initrd_addr = (uint64_t)boot_params->hdr.initrd_addr_max + 1;
    if (boot_params->hdr.version < 0x0200)
	max_initrd_addr = 0;	/* No initrd supported */
    else if (boot_params->hdr.version < 0x0203)
	max_initrd_addr = 0x38000000;
    else if (boot_params->hdr.version >= 0x020c &&
	       (boot_params->hdr.xloadflags & XLF_CAN_BE_LOADED_ABOVE_4G))
	max_initrd_addr = (uint64_t)1 << 52; /* Architecture-imposed limit */

    /*
     * Maximum command line size *including terminating null*
     */
    unsigned int cmdline_size;
    if (boot_params->hdr.version < 0x0200)
	cmdline_size = 0;	/* No command line supported */
    else if (boot_params->hdr.version < 0x0206)
	boot_params->hdr.cmdline_size = 256;
    else
	boot_params->hdr.cmdline_size + 1;

    /* Command line size including terminating null */

    /*
     * Load addresses for the low and high kernels, respectively
     */
    uint32_t low_kernel_address;
    uint64_t cmdline_addr;	/* Address to load the command line */

    if (is_bios_boot()) {
	if (!(boot_params->hdr.loadflags & LOADED_HIGH)) {
	    low_kernel_address = 0x90000;
	} else {
	    /*
	     * Recommended to be the lowest available address between
	     * 0x10000 and 0x90000
	     */
	    low_kernel_address = preferred_low_kernel_address();
	}

	uint32_t lowkernel_max;

	lowkernel_max = low_kernel_address + 0x10000;
	if (boot_params.hdr.version >= 0x0202)
	    lowkernel_max += (cmdline_size + 15) & ~15;

	/*
	 * end_of_low_memory() is usually given by *(uint8_t *)0x413 << 10
	 */
	if (lowkernel_max > end_of_low_memory())
	    lowkernel_max = end_of_low_memory();

	cmdline_addr = (lowkernel_max - cmdline_size) & ~15;
	if (boot_params->hdr.version >= 0x0202)
	    kernel_setup_header->cmd_line_ptr = cmdline_addr;
	else if (boot_params->hdr.version >= 0x0200)
	    kernel_setup_header->setup_move_size =
		lowkernel_max - low_kernel_address;

	if (boot_params.hdr.version >= 0x0201) {
	    kernel_setup_header->heap_end_ptr
		= cmdline_addr - low_kernel_address - 0x0200;
	    kernel_setup_header->loadflags |= CAN_USE_HEAP;
	}
    } else {
	low_kernel_address = 0;	/* Not used for non-BIOS boot */
	cmdline_addr = 0;	/* Not assigned yet */
    }

    /*
     * Default load address for the high kernel, and if it can be relocated
     */
    uint64_t high_kernel_address;
    uint32_t high_kernel_size;	/* The amount of memory the high kernel needs */
    bool relocatable_kernel = false;
    uint32_t high_kernel_alignment = 0x400000; /* Kernel runtime alignment */

    if (!(boot_params->hdr.loadflags & LOADED_HIGH)) {
	high_kernel_address = 0x10000;
    } else {
	if (boot_params->hdr.version >= 0x020a)
	    high_kernel_address = boot_params->hdr.pref_address;
	else
	    high_kernel_address = 0x100000;

	if (boot_params->hdr.version >= 0x0205 &&
	    boot_params->hdr.relocatable_kernel) {
	    relocatable_kernel = true;
	    high_kernel_alignment = boot_params->hdr.kernel_alignment;
	}
    }

    /*
     * Linear memory area needed by the kernel
     */
    uint32_t kernel_mem_size;
    if (boot_params->hdr.version >= 0x020a)
	kernel_mem_size = boot_params->hdr.init_size;
    else
	kernel_mem_size = kernel_image_size << 2; /* Pure guesswork... */

    /* Relocate the kernel load address if desired */
    if (relocatable_kernel) {
	high_kernel_address =
	    maybe_relocate_kernel(high_kernel_address,
				  max_initrd_addr - kernel_mem_size,
				  high_kernel_aligment);
    }

    /* Adjust for possible internal kernel realigment */
    kernel_mem_size += (-high_kernel_address) & (high_kernel_alignment - 1);

    /*
     * Determine the minimum safe address for loading initrd, setup_data,
     * and, if cmdline_addr == 0 (i.e. !is_bios_boot()), the command line.
     */
    uint64_t min_initrd_addr = high_kernel_address + kernel_mem_size;
}
Jason A. Donenfeld Dec. 29, 2022, 2:31 a.m. UTC | #6
Hi,

Read this message in a fixed width text editor with a lot of columns.

On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
> Glad you asked.
> 
> So the kernel load addresses are parameterized in the kernel image
> setup header. One of the things that are so parameterized are the size
> and possible realignment of the kernel image in memory.
> 
> I'm very confused where you are getting the 64 MB number from. There
> should not be any such limitation.

Currently, QEMU appends it to the kernel image, not to the initramfs as
you suggest below. So, that winds up looking, currently, like:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

The problem is that this decompresses to 0x1000000 (one more zero). So
if l1 is > (0x1000000-0x100000), then this winds up looking like:

          kernel image            setup_data
   |--------------------------||----------------|
0x100000                  0x100000+l1     0x100000+l1+l2

                                 d e c o m p r e s s e d   k e r n e l
		     |-------------------------------------------------------------|
                0x1000000                                                     0x1000000+l3 

The decompressed kernel seemingly overwriting the compressed kernel
image isn't a problem, because that gets relocated to a higher address
early on in the boot process. setup_data, however, stays in the same
place, since those links are self referential and nothing fixes them up.
So the decompressed kernel clobbers it.

The solution in this commit adds a bunch of padding between the kernel
image and setup_data to avoid this. That looks like this:

          kernel image                            padding                               setup_data
   |--------------------------||---------------------------------------------------||----------------|
0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2

                                 d e c o m p r e s s e d   k e r n e l
		     |-------------------------------------------------------------|
                0x1000000                                                     0x1000000+l3 

This way, the decompressed kernel doesn't clobber setup_data.

The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
then the bootloader crashes when trying to dereference setup_data's
->len param at the end of initialize_identity_maps() in ident_map_64.c.
I don't know why it does this. If I could remove the 62 megabyte
restriction, then I could keep with this technique and all would be
well.

> In general, setup_data should be able to go anywhere the initrd can
> go, and so is subject to the same address cap (896 MB for old kernels,
> 4 GB on newer ones; this address too is enumerated in the header.)

It would be theoretically possible to attach it to the initrd image
instead of to the kernel image. As a last resort, I guess I can look
into doing that. However, that's going to require some serious rework
and plumbing of a lot of different components. So if I can make it work
as is, that'd be ideal. However, I need to figure out this weird 62 meg
limitation.

Any ideas on that?

Jason
Philippe Mathieu-Daudé Dec. 29, 2022, 7:28 a.m. UTC | #7
On 29/12/22 03:31, Jason A. Donenfeld wrote:
> Hi,
> 
> Read this message in a fixed width text editor with a lot of columns.
> 
> On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>>
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>>
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.

[...]

Thanks for the diagrams. Feel free to include them in the commit
description ;)

>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
> 
> It would be theoretically possible to attach it to the initrd image
> instead of to the kernel image. As a last resort, I guess I can look
> into doing that. However, that's going to require some serious rework
> and plumbing of a lot of different components. So if I can make it work
> as is, that'd be ideal. However, I need to figure out this weird 62 meg
> limitation.
> 
> Any ideas on that?

Could it be a limitation (internal buffer) of the decompressor?
H. Peter Anvin Dec. 29, 2022, 7:30 a.m. UTC | #8
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>Hi,
>
>Read this message in a fixed width text editor with a lot of columns.
>
>On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>> 
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>> 
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.
>
>Currently, QEMU appends it to the kernel image, not to the initramfs as
>you suggest below. So, that winds up looking, currently, like:
>
>          kernel image            setup_data
>   |--------------------------||----------------|
>0x100000                  0x100000+l1     0x100000+l1+l2
>
>The problem is that this decompresses to 0x1000000 (one more zero). So
>if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
>          kernel image            setup_data
>   |--------------------------||----------------|
>0x100000                  0x100000+l1     0x100000+l1+l2
>
>                                 d e c o m p r e s s e d   k e r n e l
>		     |-------------------------------------------------------------|
>                0x1000000                                                     0x1000000+l3 
>
>The decompressed kernel seemingly overwriting the compressed kernel
>image isn't a problem, because that gets relocated to a higher address
>early on in the boot process. setup_data, however, stays in the same
>place, since those links are self referential and nothing fixes them up.
>So the decompressed kernel clobbers it.
>
>The solution in this commit adds a bunch of padding between the kernel
>image and setup_data to avoid this. That looks like this:
>
>          kernel image                            padding                               setup_data
>   |--------------------------||---------------------------------------------------||----------------|
>0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2
>
>                                 d e c o m p r e s s e d   k e r n e l
>		     |-------------------------------------------------------------|
>                0x1000000                                                     0x1000000+l3 
>
>This way, the decompressed kernel doesn't clobber setup_data.
>
>The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>then the bootloader crashes when trying to dereference setup_data's
>->len param at the end of initialize_identity_maps() in ident_map_64.c.
>I don't know why it does this. If I could remove the 62 megabyte
>restriction, then I could keep with this technique and all would be
>well.
>
>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
>
>It would be theoretically possible to attach it to the initrd image
>instead of to the kernel image. As a last resort, I guess I can look
>into doing that. However, that's going to require some serious rework
>and plumbing of a lot of different components. So if I can make it work
>as is, that'd be ideal. However, I need to figure out this weird 62 meg
>limitation.
>
>Any ideas on that?
>
>Jason

Ok, the code I sent will figure out the minimum amount of padding that you need (min_initrd_addr) as well.
H. Peter Anvin Dec. 29, 2022, 7:31 a.m. UTC | #9
On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>Hi,
>
>Read this message in a fixed width text editor with a lot of columns.
>
>On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> Glad you asked.
>> 
>> So the kernel load addresses are parameterized in the kernel image
>> setup header. One of the things that are so parameterized are the size
>> and possible realignment of the kernel image in memory.
>> 
>> I'm very confused where you are getting the 64 MB number from. There
>> should not be any such limitation.
>
>Currently, QEMU appends it to the kernel image, not to the initramfs as
>you suggest below. So, that winds up looking, currently, like:
>
>          kernel image            setup_data
>   |--------------------------||----------------|
>0x100000                  0x100000+l1     0x100000+l1+l2
>
>The problem is that this decompresses to 0x1000000 (one more zero). So
>if l1 is > (0x1000000-0x100000), then this winds up looking like:
>
>          kernel image            setup_data
>   |--------------------------||----------------|
>0x100000                  0x100000+l1     0x100000+l1+l2
>
>                                 d e c o m p r e s s e d   k e r n e l
>		     |-------------------------------------------------------------|
>                0x1000000                                                     0x1000000+l3 
>
>The decompressed kernel seemingly overwriting the compressed kernel
>image isn't a problem, because that gets relocated to a higher address
>early on in the boot process. setup_data, however, stays in the same
>place, since those links are self referential and nothing fixes them up.
>So the decompressed kernel clobbers it.
>
>The solution in this commit adds a bunch of padding between the kernel
>image and setup_data to avoid this. That looks like this:
>
>          kernel image                            padding                               setup_data
>   |--------------------------||---------------------------------------------------||----------------|
>0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2
>
>                                 d e c o m p r e s s e d   k e r n e l
>		     |-------------------------------------------------------------|
>                0x1000000                                                     0x1000000+l3 
>
>This way, the decompressed kernel doesn't clobber setup_data.
>
>The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>then the bootloader crashes when trying to dereference setup_data's
>->len param at the end of initialize_identity_maps() in ident_map_64.c.
>I don't know why it does this. If I could remove the 62 megabyte
>restriction, then I could keep with this technique and all would be
>well.
>
>> In general, setup_data should be able to go anywhere the initrd can
>> go, and so is subject to the same address cap (896 MB for old kernels,
>> 4 GB on newer ones; this address too is enumerated in the header.)
>
>It would be theoretically possible to attach it to the initrd image
>instead of to the kernel image. As a last resort, I guess I can look
>into doing that. However, that's going to require some serious rework
>and plumbing of a lot of different components. So if I can make it work
>as is, that'd be ideal. However, I need to figure out this weird 62 meg
>limitation.
>
>Any ideas on that?
>
>Jason

As far as a crash... that sounds like a big and a pretty serious one at that.

Could you let me know what kernel you are using and how *exactly* you are booting it?
Borislav Petkov Dec. 29, 2022, 12:47 p.m. UTC | #10
On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> As far as a crash... that sounds like a big and a pretty serious one at that.
> 
> Could you let me know what kernel you are using and how *exactly* you are booting it?

Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:

early console in extract_kernel
input_data: 0x000000000be073a8
input_len: 0x00000000008cfc43
output: 0x0000000001000000
output_len: 0x000000000b600a98
kernel_total_size: 0x000000000ac26000
needed_size: 0x000000000b800000
trampoline_32bit: 0x000000000009d000

so that's a ~9M kernel which gets decompressed at 0x1000000 and the
output len is, what, ~180M which looks like plenty to me...
Jason A. Donenfeld Dec. 30, 2022, 3:54 p.m. UTC | #11
On Thu, Dec 29, 2022 at 01:47:49PM +0100, Borislav Petkov wrote:
> On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> > As far as a crash... that sounds like a big and a pretty serious one at that.
> > 
> > Could you let me know what kernel you are using and how *exactly* you are booting it?
> 
> Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> 
> early console in extract_kernel
> input_data: 0x000000000be073a8
> input_len: 0x00000000008cfc43
> output: 0x0000000001000000
> output_len: 0x000000000b600a98
> kernel_total_size: 0x000000000ac26000
> needed_size: 0x000000000b800000
> trampoline_32bit: 0x000000000009d000
> 
> so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> output len is, what, ~180M which looks like plenty to me...

I think you might have misunderstood the thread. First, to reproduce the
bug that this patch fixes, you need a kernel with a compressed size of
around 16 megs, not 9. Secondly, that crash is well understood and
doesn't need to be reproduced; this patch fixes it. Rather, the question
now is how to improve this patch to remove the 62 meg limit. I'll follow
up with hpa's request for reproduction info.

Jason
Jason A. Donenfeld Dec. 30, 2022, 3:59 p.m. UTC | #12
Hi,

On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
> >Hi,
> >
> >Read this message in a fixed width text editor with a lot of columns.
> >
> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
> >> Glad you asked.
> >> 
> >> So the kernel load addresses are parameterized in the kernel image
> >> setup header. One of the things that are so parameterized are the size
> >> and possible realignment of the kernel image in memory.
> >> 
> >> I'm very confused where you are getting the 64 MB number from. There
> >> should not be any such limitation.
> >
> >Currently, QEMU appends it to the kernel image, not to the initramfs as
> >you suggest below. So, that winds up looking, currently, like:
> >
> >          kernel image            setup_data
> >   |--------------------------||----------------|
> >0x100000                  0x100000+l1     0x100000+l1+l2
> >
> >The problem is that this decompresses to 0x1000000 (one more zero). So
> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
> >
> >          kernel image            setup_data
> >   |--------------------------||----------------|
> >0x100000                  0x100000+l1     0x100000+l1+l2
> >
> >                                 d e c o m p r e s s e d   k e r n e l
> >		     |-------------------------------------------------------------|
> >                0x1000000                                                     0x1000000+l3 
> >
> >The decompressed kernel seemingly overwriting the compressed kernel
> >image isn't a problem, because that gets relocated to a higher address
> >early on in the boot process. setup_data, however, stays in the same
> >place, since those links are self referential and nothing fixes them up.
> >So the decompressed kernel clobbers it.
> >
> >The solution in this commit adds a bunch of padding between the kernel
> >image and setup_data to avoid this. That looks like this:
> >
> >          kernel image                            padding                               setup_data
> >   |--------------------------||---------------------------------------------------||----------------|
> >0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2
> >
> >                                 d e c o m p r e s s e d   k e r n e l
> >		     |-------------------------------------------------------------|
> >                0x1000000                                                     0x1000000+l3 
> >
> >This way, the decompressed kernel doesn't clobber setup_data.
> >
> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
> >then the bootloader crashes when trying to dereference setup_data's
> >->len param at the end of initialize_identity_maps() in ident_map_64.c.
> >I don't know why it does this. If I could remove the 62 megabyte
> >restriction, then I could keep with this technique and all would be
> >well.
> >
> >> In general, setup_data should be able to go anywhere the initrd can
> >> go, and so is subject to the same address cap (896 MB for old kernels,
> >> 4 GB on newer ones; this address too is enumerated in the header.)
> >
> >It would be theoretically possible to attach it to the initrd image
> >instead of to the kernel image. As a last resort, I guess I can look
> >into doing that. However, that's going to require some serious rework
> >and plumbing of a lot of different components. So if I can make it work
> >as is, that'd be ideal. However, I need to figure out this weird 62 meg
> >limitation.
> >
> >Any ideas on that?
> >
> >Jason
> 
> As far as a crash... that sounds like a big and a pretty serious one at that.
> 
> Could you let me know what kernel you are using and how *exactly* you are booting it?

I'll attach a .config file. Apply the patch at the top of this thread to
qemu, except make one modification:

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 628fd2b2e9..a61ee23e13 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms,
 
             /* The early stage can't address past around 64 MB from the original
              * mapping, so just give up in that case. */
-            if (padded_size < 62 * 1024 * 1024)
+            if (true || padded_size < 62 * 1024 * 1024)
                 kernel_size = padded_size;
             else {
                 fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");

Then build qemu. Run it with `-kernel bzImage`, based on the kernel
built with the .config I attached.

You'll see that the CPU triple faults when hitting this line:

        sd = (struct setup_data *)boot_params->hdr.setup_data;
        while (sd) {
                unsigned long sd_addr = (unsigned long)sd;

                kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);  <----
                sd = (struct setup_data *)sd->next;
        }

, because it dereferences *sd. This does not happen if the decompressed
size of the kernel is < 62 megs.

So that's the "big and pretty serious" bug that might be worthy of
investigation.

Jason
Jason A. Donenfeld Dec. 30, 2022, 4:21 p.m. UTC | #13
Er, .config attached now.
Borislav Petkov Dec. 30, 2022, 5:01 p.m. UTC | #14
On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote:
> > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> > 
> > early console in extract_kernel
> > input_data: 0x000000000be073a8
> > input_len: 0x00000000008cfc43
> > output: 0x0000000001000000
> > output_len: 0x000000000b600a98
> > kernel_total_size: 0x000000000ac26000
> > needed_size: 0x000000000b800000
> > trampoline_32bit: 0x000000000009d000
> > 
> > so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> > output len is, what, ~180M which looks like plenty to me...
> 
> I think you might have misunderstood the thread.

Maybe...

I've been trying to make sense of all the decompression dancing we're doing and
the diagrams you've drawn and there's a comment over extract_kernel() which
basically says that the compressed image is at the back of the memory buffer

input_data: 0x000000000be073a8

and we decompress to a smaller address

output: 0x0000000001000000

And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when
we start decompressing, we overwrite it.

I guess extract_kernel() could also dump the setup_data addresses so that we can
verify that...

> First, to reproduce the bug that this patch fixes, you need a kernel with a
> compressed size of around 16 megs, not 9.

Not only that - you need setup_data to be placed somewhere just so that it gets
overwritten during decompression. Also, see below.
 
> Secondly, that crash is well understood and doesn't need to be reproduced;

Is it?

Where do you get the 0x100000 as the starting address of the kernel image?

Because input_data above says that the input address is somewhere higher...

Btw, here's an allyesconfig:

early console in setup code
No EFI environment detected.
early console in extract_kernel
input_data: 0x000000001bd003a8
input_len: 0x000000000cde2963
output: 0x0000000001000000
output_len: 0x0000000027849d74
kernel_total_size: 0x0000000025830000
needed_size: 0x0000000027a00000
trampoline_32bit: 0x000000000009d000
Physical KASLR using RDRAND RDTSC...
Virtual KASLR using RDRAND RDTSC...

That kernel has compressed size of ~205M and the output buffer is of size ~632M.

> this patch fixes it. Rather, the question now is how to improve this patch
> to remove the 62 meg limit.

Yeah, I'm wondering what that 62M limit is too, actually.

But maybe I'm misunderstanding...
Jason A. Donenfeld Dec. 30, 2022, 5:07 p.m. UTC | #15
On Fri, Dec 30, 2022 at 6:01 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Dec 30, 2022 at 04:54:27PM +0100, Jason A. Donenfeld wrote:
> > > Right, with CONFIG_X86_VERBOSE_BOOTUP=y in a guest here, it says:
> > >
> > > early console in extract_kernel
> > > input_data: 0x000000000be073a8
> > > input_len: 0x00000000008cfc43
> > > output: 0x0000000001000000
> > > output_len: 0x000000000b600a98
> > > kernel_total_size: 0x000000000ac26000
> > > needed_size: 0x000000000b800000
> > > trampoline_32bit: 0x000000000009d000
> > >
> > > so that's a ~9M kernel which gets decompressed at 0x1000000 and the
> > > output len is, what, ~180M which looks like plenty to me...
> >
> > I think you might have misunderstood the thread.
>
> Maybe...
>
> I've been trying to make sense of all the decompression dancing we're doing and
> the diagrams you've drawn and there's a comment over extract_kernel() which
> basically says that the compressed image is at the back of the memory buffer
>
> input_data: 0x000000000be073a8
>
> and we decompress to a smaller address
>
> output: 0x0000000001000000
>
> And, it *looks* like setup_data is at an address somewhere >= 0x1000000 so when
> we start decompressing, we overwrite it.
>
> I guess extract_kernel() could also dump the setup_data addresses so that we can
> verify that...
>
> > First, to reproduce the bug that this patch fixes, you need a kernel with a
> > compressed size of around 16 megs, not 9.
>
> Not only that - you need setup_data to be placed somewhere just so that it gets
> overwritten during decompression. Also, see below.
>
> > Secondly, that crash is well understood and doesn't need to be reproduced;
>
> Is it?
>
> Where do you get the 0x100000 as the starting address of the kernel image?
>
> Because input_data above says that the input address is somewhere higher...

Look closer at the boot process. The compressed image is initially at
0x100000, but it gets relocated to a safer area at the end of
startup_64:

/*
 * Copy the compressed kernel to the end of our buffer
 * where decompression in place becomes safe.
 */
        pushq   %rsi
        leaq    (_bss-8)(%rip), %rsi
        leaq    rva(_bss-8)(%rbx), %rdi
        movl    $(_bss - startup_32), %ecx
        shrl    $3, %ecx
        std
        rep     movsq
        cld
        popq    %rsi

        /*
         * The GDT may get overwritten either during the copy we just did or
         * during extract_kernel below. To avoid any issues, repoint the GDTR
         * to the new copy of the GDT.
         */
        leaq    rva(gdt64)(%rbx), %rax
        leaq    rva(gdt)(%rbx), %rdx
        movq    %rdx, 2(%rax)
        lgdt    (%rax)

/*
 * Jump to the relocated address.
 */
        leaq    rva(.Lrelocated)(%rbx), %rax
        jmp     *%rax

And that address winds up being calculated with a combination of the
image size and the init_size param, so it's safe. Decompression won't
overwrite the compressed kernel.

HOWEVER, qemu currently appends setup_data to the end of the
compressed kernel image, and this part isn't moved, and setup_data
links aren't walked/relocated. So that means the original address
remains, of 0x100000.

Jason
Jason A. Donenfeld Dec. 30, 2022, 6:30 p.m. UTC | #16
On Wed, Dec 28, 2022 at 03:38:30PM +0100, Jason A. Donenfeld wrote:
> The setup_data links are appended to the compressed kernel image. Since
> the kernel image is typically loaded at 0x100000, setup_data lives at
> `0x100000 + compressed_size`, which does not get relocated during the
> kernel's boot process.
> 
> The kernel typically decompresses the image starting at address
> 0x1000000 (note: there's one more zero there than the decompressed image
> above). This usually is fine for most kernels.
> 
> However, if the compressed image is actually quite large, then
> setup_data will live at a `0x100000 + compressed_size` that extends into
> the decompressed zone at 0x1000000. In other words, if compressed_size
> is larger than `0x1000000 - 0x100000`, then the decompression step will
> clobber setup_data, resulting in crashes.
> 
> Fix this by detecting that possibility, and if it occurs, put setup_data
> *after* the end of the decompressed kernel, so that it doesn't get
> clobbered.
> 
> One caveat is that this only works for images less than around 64
> megabytes, so just bail out in that case. This is unfortunate, but I
> don't currently have a way of fixing it.
 
I've got a different solution now that tries to piggy back on cmdline.
I'll send a v2. It avoids the 62MiB crash situation of this one and
seems to work fine.
H. Peter Anvin Dec. 30, 2022, 7:13 p.m. UTC | #17
On December 30, 2022 7:59:30 AM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>Hi,
>
>On Wed, Dec 28, 2022 at 11:31:34PM -0800, H. Peter Anvin wrote:
>> On December 28, 2022 6:31:07 PM PST, "Jason A. Donenfeld" <Jason@zx2c4.com> wrote:
>> >Hi,
>> >
>> >Read this message in a fixed width text editor with a lot of columns.
>> >
>> >On Wed, Dec 28, 2022 at 03:58:12PM -0800, H. Peter Anvin wrote:
>> >> Glad you asked.
>> >> 
>> >> So the kernel load addresses are parameterized in the kernel image
>> >> setup header. One of the things that are so parameterized are the size
>> >> and possible realignment of the kernel image in memory.
>> >> 
>> >> I'm very confused where you are getting the 64 MB number from. There
>> >> should not be any such limitation.
>> >
>> >Currently, QEMU appends it to the kernel image, not to the initramfs as
>> >you suggest below. So, that winds up looking, currently, like:
>> >
>> >          kernel image            setup_data
>> >   |--------------------------||----------------|
>> >0x100000                  0x100000+l1     0x100000+l1+l2
>> >
>> >The problem is that this decompresses to 0x1000000 (one more zero). So
>> >if l1 is > (0x1000000-0x100000), then this winds up looking like:
>> >
>> >          kernel image            setup_data
>> >   |--------------------------||----------------|
>> >0x100000                  0x100000+l1     0x100000+l1+l2
>> >
>> >                                 d e c o m p r e s s e d   k e r n e l
>> >		     |-------------------------------------------------------------|
>> >                0x1000000                                                     0x1000000+l3 
>> >
>> >The decompressed kernel seemingly overwriting the compressed kernel
>> >image isn't a problem, because that gets relocated to a higher address
>> >early on in the boot process. setup_data, however, stays in the same
>> >place, since those links are self referential and nothing fixes them up.
>> >So the decompressed kernel clobbers it.
>> >
>> >The solution in this commit adds a bunch of padding between the kernel
>> >image and setup_data to avoid this. That looks like this:
>> >
>> >          kernel image                            padding                               setup_data
>> >   |--------------------------||---------------------------------------------------||----------------|
>> >0x100000                  0x100000+l1                                         0x1000000+l3      0x1000000+l3+l2
>> >
>> >                                 d e c o m p r e s s e d   k e r n e l
>> >		     |-------------------------------------------------------------|
>> >                0x1000000                                                     0x1000000+l3 
>> >
>> >This way, the decompressed kernel doesn't clobber setup_data.
>> >
>> >The problem is that if 0x1000000+l3-0x100000 is around 62 megabytes,
>> >then the bootloader crashes when trying to dereference setup_data's
>> >->len param at the end of initialize_identity_maps() in ident_map_64.c.
>> >I don't know why it does this. If I could remove the 62 megabyte
>> >restriction, then I could keep with this technique and all would be
>> >well.
>> >
>> >> In general, setup_data should be able to go anywhere the initrd can
>> >> go, and so is subject to the same address cap (896 MB for old kernels,
>> >> 4 GB on newer ones; this address too is enumerated in the header.)
>> >
>> >It would be theoretically possible to attach it to the initrd image
>> >instead of to the kernel image. As a last resort, I guess I can look
>> >into doing that. However, that's going to require some serious rework
>> >and plumbing of a lot of different components. So if I can make it work
>> >as is, that'd be ideal. However, I need to figure out this weird 62 meg
>> >limitation.
>> >
>> >Any ideas on that?
>> >
>> >Jason
>> 
>> As far as a crash... that sounds like a big and a pretty serious one at that.
>> 
>> Could you let me know what kernel you are using and how *exactly* you are booting it?
>
>I'll attach a .config file. Apply the patch at the top of this thread to
>qemu, except make one modification:
>
>diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>index 628fd2b2e9..a61ee23e13 100644
>--- a/hw/i386/x86.c
>+++ b/hw/i386/x86.c
>@@ -1097,7 +1097,7 @@ void x86_load_linux(X86MachineState *x86ms,
> 
>             /* The early stage can't address past around 64 MB from the original
>              * mapping, so just give up in that case. */
>-            if (padded_size < 62 * 1024 * 1024)
>+            if (true || padded_size < 62 * 1024 * 1024)
>                 kernel_size = padded_size;
>             else {
>                 fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
>
>Then build qemu. Run it with `-kernel bzImage`, based on the kernel
>built with the .config I attached.
>
>You'll see that the CPU triple faults when hitting this line:
>
>        sd = (struct setup_data *)boot_params->hdr.setup_data;
>        while (sd) {
>                unsigned long sd_addr = (unsigned long)sd;
>
>                kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);  <----
>                sd = (struct setup_data *)sd->next;
>        }
>
>, because it dereferences *sd. This does not happen if the decompressed
>size of the kernel is < 62 megs.
>
>So that's the "big and pretty serious" bug that might be worthy of
>investigation.
>
>Jason

No kidding. Dereferencing data *before you map it* is generally frowned upon.

This needs to be split into to making calls.

*Facepalm*
Borislav Petkov Dec. 30, 2022, 7:54 p.m. UTC | #18
On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
> Look closer at the boot process. The compressed image is initially at
> 0x100000, but it gets relocated to a safer area at the end of
> startup_64:

That is the address we're executing here from, rip here looks like 0x100xxx.

> /*
>  * Copy the compressed kernel to the end of our buffer
>  * where decompression in place becomes safe.
>  */
>         pushq   %rsi
>         leaq    (_bss-8)(%rip), %rsi
>         leaq    rva(_bss-8)(%rbx), %rdi

when you get to here, it looks something like this:

        leaq    (_bss-8)(%rip), %rsi		# 0x9e7ff8
        leaq    rva(_bss-8)(%rbx), %rdi		# 0xc6eeff8

so the source address is that _bss thing and we copy...

>         movl    $(_bss - startup_32), %ecx
>         shrl    $3, %ecx
>         std

... backwards since DF=1.

Up to:

# rsi = 0xffff8
# rdi = 0xbe06ff8

Ok, so the source address is 0x100000. Good.

> HOWEVER, qemu currently appends setup_data to the end of the
> compressed kernel image,

Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part
which is compressed/head_64.S and which does the above and the relocation etc.

> and this part isn't moved, and setup_data links aren't walked/relocated. So
> that means the original address remains, of 0x100000.

See above: when it starts copying the kernel image backwards to a higher
address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
*after* that address. And that doesn't get copied ofc.

So far, so good.

Now later, we extract the compressed kernel created with the mkpiggy magic:

input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:

by doing

/*
 * Do the extraction, and jump to the new kernel..
 */

        pushq   %rsi                    /* Save the real mode argument */	0x13d00
        movq    %rsi, %rdi              /* real mode address */			0x13d00
        leaq    boot_heap(%rip), %rsi   /* malloc area for uncompression */	0xc6ef000
        leaq    input_data(%rip), %rdx  /* input_data */			0xbe073a8
        movl    input_len(%rip), %ecx   /* input_len */				0x8cfe13
        movq    %rbp, %r8               /* output target address */		0x1000000
        movl    output_len(%rip), %r9d  /* decompressed length, end of relocs */
        call    extract_kernel          /* returns kernel location in %rax */
        popq    %rsi

(actual addresses at the end.)

Now, when you say you triplefault somewhere in initialize_identity_maps() when
trying to access setup_data, then if you look a couple of lines before that call
we do

	call load_stage2_idt

which sets up a boottime #PF handler do_boot_page_fault() and it actually does
call kernel_add_identity_map() so *actually* it should map any unmapped
setup_data addresses.

So why doesn't it do that and why do you triplefault?

Hmmm.
H. Peter Anvin Dec. 30, 2022, 9:58 p.m. UTC | #19
On December 30, 2022 11:54:11 AM PST, Borislav Petkov <bp@alien8.de> wrote:
>On Fri, Dec 30, 2022 at 06:07:24PM +0100, Jason A. Donenfeld wrote:
>> Look closer at the boot process. The compressed image is initially at
>> 0x100000, but it gets relocated to a safer area at the end of
>> startup_64:
>
>That is the address we're executing here from, rip here looks like 0x100xxx.
>
>> /*
>>  * Copy the compressed kernel to the end of our buffer
>>  * where decompression in place becomes safe.
>>  */
>>         pushq   %rsi
>>         leaq    (_bss-8)(%rip), %rsi
>>         leaq    rva(_bss-8)(%rbx), %rdi
>
>when you get to here, it looks something like this:
>
>        leaq    (_bss-8)(%rip), %rsi		# 0x9e7ff8
>        leaq    rva(_bss-8)(%rbx), %rdi		# 0xc6eeff8
>
>so the source address is that _bss thing and we copy...
>
>>         movl    $(_bss - startup_32), %ecx
>>         shrl    $3, %ecx
>>         std
>
>... backwards since DF=1.
>
>Up to:
>
># rsi = 0xffff8
># rdi = 0xbe06ff8
>
>Ok, so the source address is 0x100000. Good.
>
>> HOWEVER, qemu currently appends setup_data to the end of the
>> compressed kernel image,
>
>Yeah, you mean the kernel which starts executing at 0x100000, i.e., that part
>which is compressed/head_64.S and which does the above and the relocation etc.
>
>> and this part isn't moved, and setup_data links aren't walked/relocated. So
>> that means the original address remains, of 0x100000.
>
>See above: when it starts copying the kernel image backwards to a higher
>address, that last byte is at 0x9e7ff8 so I'm guessing qemu has put setup_data
>*after* that address. And that doesn't get copied ofc.
>
>So far, so good.
>
>Now later, we extract the compressed kernel created with the mkpiggy magic:
>
>input_data:
>.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
>input_data_end:
>
>by doing
>
>/*
> * Do the extraction, and jump to the new kernel..
> */
>
>        pushq   %rsi                    /* Save the real mode argument */	0x13d00
>        movq    %rsi, %rdi              /* real mode address */			0x13d00
>        leaq    boot_heap(%rip), %rsi   /* malloc area for uncompression */	0xc6ef000
>        leaq    input_data(%rip), %rdx  /* input_data */			0xbe073a8
>        movl    input_len(%rip), %ecx   /* input_len */				0x8cfe13
>        movq    %rbp, %r8               /* output target address */		0x1000000
>        movl    output_len(%rip), %r9d  /* decompressed length, end of relocs */
>        call    extract_kernel          /* returns kernel location in %rax */
>        popq    %rsi
>
>(actual addresses at the end.)
>
>Now, when you say you triplefault somewhere in initialize_identity_maps() when
>trying to access setup_data, then if you look a couple of lines before that call
>we do
>
>	call load_stage2_idt
>
>which sets up a boottime #PF handler do_boot_page_fault() and it actually does
>call kernel_add_identity_map() so *actually* it should map any unmapped
>setup_data addresses.
>
>So why doesn't it do that and why do you triplefault?
>
>Hmmm.
>

See the other thread fork. They have identified the problem already.
Jason A. Donenfeld Dec. 30, 2022, 10:10 p.m. UTC | #20
On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
> See the other thread fork. They have identified the problem already.

Not sure I follow. Is there another thread where somebody worked out why
this 62meg limit was happening?

Note that I sent v2/v3, to fix the original problem in a different way,
and if that looks good to the QEMU maintainers, then we can all be happy
with that. But I *haven't* addressed and still don't fully understand
why the 62meg limit applied to my v1 in the way it does. Did you find a
bug there to fix? If so, please do CC me.

Jason
H. Peter Anvin Dec. 31, 2022, 1:06 a.m. UTC | #21
On 12/30/22 14:10, Jason A. Donenfeld wrote:
> On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
>> See the other thread fork. They have identified the problem already.
> 
> Not sure I follow. Is there another thread where somebody worked out why
> this 62meg limit was happening?
> 
> Note that I sent v2/v3, to fix the original problem in a different way,
> and if that looks good to the QEMU maintainers, then we can all be happy
> with that. But I *haven't* addressed and still don't fully understand
> why the 62meg limit applied to my v1 in the way it does. Did you find a
> bug there to fix? If so, please do CC me.
> 

Yes, you yourself posted the problem:

> Then build qemu. Run it with `-kernel bzImage`, based on the kernel
> built with the .config I attached.
> 
> You'll see that the CPU triple faults when hitting this line:
> 
>         sd = (struct setup_data *)boot_params->hdr.setup_data;
>         while (sd) {
>                 unsigned long sd_addr = (unsigned long)sd;
> 
>                 kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);  <----
>                 sd = (struct setup_data *)sd->next;
>         }
> 
> , because it dereferences *sd. This does not happen if the decompressed
> size of the kernel is < 62 megs.
> 
> So that's the "big and pretty serious" bug that might be worthy of
> investigation.

This needs to be something like:

kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
kernel_add_identity_map(sd_addr + sizeof(*sd),
	sd_addr + sizeof(*sd) + sd->len);


TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to 
very, very old kernels that used INT 15h, AH=88h to probe memory.

	-hpa
H. Peter Anvin Dec. 31, 2022, 1:14 a.m. UTC | #22
On 12/30/22 17:06, H. Peter Anvin wrote
> 
> TThe 62 MB limit mentioned in boot.rst is unrelated, and only applies to 
> very, very old kernels that used INT 15h, AH=88h to probe memory.
> 

I am 88% sure this was fixed long before setup_data was created, as it 
was created originally to carry e820 info for more than 128(!) memory 
segments. However, as we see here, it is never certain that bugs didn't 
creep in in the meantime...

	-hpa
Borislav Petkov Dec. 31, 2022, 9:48 a.m. UTC | #23
On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote:
> I'll attach a .config file. Apply the patch at the top of this thread to
> qemu,

Hmm, so the patch at the top of this thread is fixing the clobbering of
setup_data.

But I tried latest qemu with your .config and it boots ok here. So how do I
repro the original issue you're trying to fix?

Thx.
Jason A. Donenfeld Dec. 31, 2022, 12:54 p.m. UTC | #24
On Sat, Dec 31, 2022 at 10:48:16AM +0100, Borislav Petkov wrote:
> On Fri, Dec 30, 2022 at 04:59:30PM +0100, Jason A. Donenfeld wrote:
> > I'll attach a .config file. Apply the patch at the top of this thread to
> > qemu,
> 
> Hmm, so the patch at the top of this thread is fixing the clobbering of
> setup_data.
> 
> But I tried latest qemu with your .config and it boots ok here. So how do I
> repro the original issue you're trying to fix?

Nothing special... `-kernel bzImage` should be enough to do it. Eric
reported it, and then I was able to repro trivially. Sure you got the
right version?

Jason
Jason A. Donenfeld Dec. 31, 2022, 12:55 p.m. UTC | #25
On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> 
> 
> On 12/30/22 14:10, Jason A. Donenfeld wrote:
> > On Fri, Dec 30, 2022 at 01:58:39PM -0800, H. Peter Anvin wrote:
> >> See the other thread fork. They have identified the problem already.
> > 
> > Not sure I follow. Is there another thread where somebody worked out why
> > this 62meg limit was happening?
> > 
> > Note that I sent v2/v3, to fix the original problem in a different way,
> > and if that looks good to the QEMU maintainers, then we can all be happy
> > with that. But I *haven't* addressed and still don't fully understand
> > why the 62meg limit applied to my v1 in the way it does. Did you find a
> > bug there to fix? If so, please do CC me.
> > 
> 
> Yes, you yourself posted the problem:
> 
> > Then build qemu. Run it with `-kernel bzImage`, based on the kernel
> > built with the .config I attached.
> > 
> > You'll see that the CPU triple faults when hitting this line:
> > 
> >         sd = (struct setup_data *)boot_params->hdr.setup_data;
> >         while (sd) {
> >                 unsigned long sd_addr = (unsigned long)sd;
> > 
> >                 kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd) + sd->len);  <----
> >                 sd = (struct setup_data *)sd->next;
> >         }
> > 
> > , because it dereferences *sd. This does not happen if the decompressed
> > size of the kernel is < 62 megs.
> > 
> > So that's the "big and pretty serious" bug that might be worthy of
> > investigation.
> 
> This needs to be something like:
> 
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> kernel_add_identity_map(sd_addr + sizeof(*sd),
> 	sd_addr + sizeof(*sd) + sd->len);
> 

Oh, right, duh. Thanks for spelling it out.

Jason
Borislav Petkov Dec. 31, 2022, 1:35 p.m. UTC | #26
On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote:
> Nothing special... `-kernel bzImage` should be enough to do it. Eric
> reported it, and then I was able to repro trivially. Sure you got the
> right version?

Yeah, qemu executables confusion here - had wrongly something older of the
version 7.1...

Now made sure I'm actually booting with the latest qemu:

QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf)

With that the kernel with your config hangs early during boot and the stack
trace is below.

Seeing how it says trapnr 14, then that looks like something you are seeing.

But lemme poke at it more.

#0  0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1  halt () at ./arch/x86/include/asm/irqflags.h:98
#2  early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3  0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
#4  0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
#5  0x2404c74100000cd0 in ?? ()
#6  0xffffffffff20073c in ?? ()
#7  0x0000000000000010 in fixed_percpu_data ()
#8  0xdffffc0000000000 in ?? ()
#9  0xffffffff84007ea8 in init_thread_union ()
#10 0xffffffffff200cd0 in ?? ()
#11 0x0000000000000000 in ?? ()
Borislav Petkov Dec. 31, 2022, 1:40 p.m. UTC | #27
On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> This needs to be something like:
> 
> kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> kernel_add_identity_map(sd_addr + sizeof(*sd),
> 	sd_addr + sizeof(*sd) + sd->len);

It still #PFs with that:

(gdb) bt
#0  0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
#1  halt () at ./arch/x86/include/asm/irqflags.h:98
#2  early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
#3  0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
#4  0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
#5  0xc149f9894908788d in ?? ()
#6  0xffffffffff2003fc in ?? ()
#7  0x0000000000000010 in fixed_percpu_data ()
#8  0xdffffc0000000000 in ?? ()
#9  0xffffffff84007ea8 in init_thread_union ()
#10 0xffffffffff20088d in ?? ()
#11 0x0000000000000000 in ?? ()

/me goes to dig more.
Jason A. Donenfeld Dec. 31, 2022, 1:42 p.m. UTC | #28
On Sat, Dec 31, 2022 at 02:35:45PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 01:54:50PM +0100, Jason A. Donenfeld wrote:
> > Nothing special... `-kernel bzImage` should be enough to do it. Eric
> > reported it, and then I was able to repro trivially. Sure you got the
> > right version?
> 
> Yeah, qemu executables confusion here - had wrongly something older of the
> version 7.1...
> 
> Now made sure I'm actually booting with the latest qemu:
> 
> QEMU emulator version 7.2.50 (v7.2.0-333-g222059a0fccf)
> 
> With that the kernel with your config hangs early during boot and the stack
> trace is below.
> 
> Seeing how it says trapnr 14, then that looks like something you are seeing.
> 
> But lemme poke at it more.

Yes. The cause is what I've described in the commit message. There are
two proposed fixes, the v1, which has the 62 MiB limitation due to a
kernel bug, and the v3, which seems to work fine, and is simpler, and I
suspect that's the one QEMU upstream should take.

https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
Jason A. Donenfeld Dec. 31, 2022, 1:44 p.m. UTC | #29
On Sat, Dec 31, 2022 at 02:40:59PM +0100, Borislav Petkov wrote:
> On Fri, Dec 30, 2022 at 05:06:55PM -0800, H. Peter Anvin wrote:
> > This needs to be something like:
> > 
> > kernel_add_identity_map(sd_addr, sd_addr + sizeof(*sd));
> > kernel_add_identity_map(sd_addr + sizeof(*sd),
> > 	sd_addr + sizeof(*sd) + sd->len);
> 
> It still #PFs with that:
> 
> (gdb) bt
> #0  0xffffffff84738576 in native_halt () at ./arch/x86/include/asm/irqflags.h:57
> #1  halt () at ./arch/x86/include/asm/irqflags.h:98
> #2  early_fixup_exception (regs=regs@entry=0xffffffff84007dc8, trapnr=trapnr@entry=14) at arch/x86/mm/extable.c:340
> #3  0xffffffff846ff465 in do_early_exception (regs=0xffffffff84007dc8, trapnr=14) at arch/x86/kernel/head64.c:424
> #4  0xffffffff846ff14f in early_idt_handler_common () at arch/x86/kernel/head_64.S:483
> #5  0xc149f9894908788d in ?? ()
> #6  0xffffffffff2003fc in ?? ()
> #7  0x0000000000000010 in fixed_percpu_data ()
> #8  0xdffffc0000000000 in ?? ()
> #9  0xffffffff84007ea8 in init_thread_union ()
> #10 0xffffffffff20088d in ?? ()
> #11 0x0000000000000000 in ?? ()
> 
> /me goes to dig more.

Are you using patch v1 minus the 62 MiB thing? If you haven't applied
patch v1 and then removed the 62 MiB limitation in it, then you've
misunderstood the conversation again.

Please see my reproduction steps to Peter:
https://lore.kernel.org/lkml/Y68K4mPuz6edQkCX@zx2c4.com/

Jason
Borislav Petkov Dec. 31, 2022, 1:48 p.m. UTC | #30
On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> Are you using patch v1 minus the 62 MiB thing?

No, I want to see the original failure - the one that prompted you to send

https://lore.kernel.org/r/20221228143831.396245-1-Jason@zx2c4.com
Jason A. Donenfeld Dec. 31, 2022, 1:51 p.m. UTC | #31
On Sat, Dec 31, 2022 at 02:48:12PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 02:44:08PM +0100, Jason A. Donenfeld wrote:
> > Are you using patch v1 minus the 62 MiB thing?
> 
> No, I want to see the original failure - the one that prompted you to send
> 
> https://lore.kernel.org/r/20221228143831.396245-1-Jason@zx2c4.com

That failure is unrelated to the ident mapping issue Peter and
I discussed. The original failure is described in the commit message:
decompression clobbers the data, so sd->next points to garbage.
Borislav Petkov Dec. 31, 2022, 2:24 p.m. UTC | #32
On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
> That failure is unrelated to the ident mapping issue Peter and
> I discussed. The original failure is described in the commit message:
> decompression clobbers the data, so sd->next points to garbage.

Right, and the fact that the kernel overwrites it still feels kinda wrong: the
kernel knows where setup_data is - the address is in the setup header so
*actually*, it should take care of not to clobber it.

But hpa would correct me if I'm missing an aspect about whose responsibility it
is to do what here...

Thx.
Jason A. Donenfeld Dec. 31, 2022, 6:22 p.m. UTC | #33
On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
> > That failure is unrelated to the ident mapping issue Peter and
> > I discussed. The original failure is described in the commit message:
> > decompression clobbers the data, so sd->next points to garbage.
> 
> Right

So with that understanding confirmed, I'm confused at your surprise that
hpa's unrelated fix to the different issue didn't fix this issue.

> and the fact that the kernel overwrites it still feels kinda wrong: the
> kernel knows where setup_data is - the address is in the setup header so
> *actually*, it should take care of not to clobber it.

Yea, technically the bootloader could relocate all the setup_data links
by copying them and updating ->next. This wouldn't be so hard to do.
(Special care would have to be taken, though, to zero out
SETUP_RNG_SEED, though, for forward secrecy and such.)

But since the kernel doesn't do this now, and the 62MiB bug also seems
to apply to existing kernels, for the purposes of QEMU for now, I think
the v3 patch is probably best, since it'll handle existing kernels.
Alternatively, setup_data could be relocated, the boot param protocol
could be bumped, and then QEMU could conditionalized it's use of
setup_data based on that protocol version. That'd work, but seems a bit
more involved.

So maybe for now, v3 works? Hopefully that looks like a correct approach
to hpa, anyhow:
https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/
I think it should fit with what he described would work.

Jason
Borislav Petkov Dec. 31, 2022, 7 p.m. UTC | #34
On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote:
> So with that understanding confirmed, I'm confused at your surprise that
> hpa's unrelated fix to the different issue didn't fix this issue.

No surprise there - I used a qemu variant without your patch to prevent the
setup_data clobbering so hpa's fix can't help there.

> But since the kernel doesn't do this now, and the 62MiB bug also seems
> to apply to existing kernels, for the purposes of QEMU for now, I think
> the v3 patch is probably best, since it'll handle existing kernels.

Right, we can't fix all those guests which are out there.

> Alternatively, setup_data could be relocated, the boot param protocol
> could be bumped, and then QEMU could conditionalized it's use of
> setup_data based on that protocol version. That'd work, but seems a bit
> more involved.

I think this is at least worth considering because the kernel overwriting
setup_data after having been told where that setup_data is located is not really
nice.

Well not explicitly at least - it gets the pointer to the first element and
something needs to traverse that list to know which addresses are live. But
still, that info is there so perhaps we need to take setup_data into
consideration too before decompressing...
H. Peter Anvin Jan. 1, 2023, 3:21 a.m. UTC | #35
On 12/31/22 11:00, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 07:22:47PM +0100, Jason A. Donenfeld wrote:
>> So with that understanding confirmed, I'm confused at your surprise that
>> hpa's unrelated fix to the different issue didn't fix this issue.
> 
> No surprise there - I used a qemu variant without your patch to prevent the
> setup_data clobbering so hpa's fix can't help there.
> 
>> But since the kernel doesn't do this now, and the 62MiB bug also seems
>> to apply to existing kernels, for the purposes of QEMU for now, I think
>> the v3 patch is probably best, since it'll handle existing kernels.
> 
> Right, we can't fix all those guests which are out there.
> 
>> Alternatively, setup_data could be relocated, the boot param protocol
>> could be bumped, and then QEMU could conditionalized it's use of
>> setup_data based on that protocol version. That'd work, but seems a bit
>> more involved.
> 
> I think this is at least worth considering because the kernel overwriting
> setup_data after having been told where that setup_data is located is not really
> nice.
> 
> Well not explicitly at least - it gets the pointer to the first element and
> something needs to traverse that list to know which addresses are live. But
> still, that info is there so perhaps we need to take setup_data into
> consideration too before decompressing...
> 

As far as the decompression itself goes, it should only a problem if we 
are using physical KASLR since otherwise the kernel has a guaranteed 
safe zone already allocated by the boot loader. However, if physical 
KASLR is in use, then the decompressor needs to know everything there is 
to know about the memory map.

However, there also seems to be some kind of interaction with AMD SEV-SNP.


The bug appears to have been introduced by:

b57feed2cc2622ae14b2fa62f19e973e5e0a60cf
x86/compressed/64: Add identity mappings for setup_data entries
https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com

... which was included in version 5.19, so it is relatively recent.

For a small amount of setup_data, the solution of just putting it next 
to the command line makes a lot of sense, and should be safe indefinitely.

	-hpa
H. Peter Anvin Jan. 1, 2023, 3:31 a.m. UTC | #36
On 12/31/22 19:21, H. Peter Anvin wrote:
>>
>>> Alternatively, setup_data could be relocated, the boot param protocol
>>> could be bumped, and then QEMU could conditionalized it's use of
>>> setup_data based on that protocol version. That'd work, but seems a bit
>>> more involved.
>>
>> I think this is at least worth considering because the kernel overwriting
>> setup_data after having been told where that setup_data is located is 
>> not really
>> nice.
>>
>> Well not explicitly at least - it gets the pointer to the first 
>> element and
>> something needs to traverse that list to know which addresses are 
>> live. But
>> still, that info is there so perhaps we need to take setup_data into
>> consideration too before decompressing...
>>

It would probably be a good idea to add a "maximum physical address for 
initrd/setup_data/cmdline" field to struct kernel_info, though. It 
appears right now that those fields are being identity-mapped in the 
decompressor, and that means that if 48-bit addressing is used, physical 
memory may extend past the addressable range.

	-hpa
H. Peter Anvin Jan. 1, 2023, 4:33 a.m. UTC | #37
On 12/31/22 10:22, Jason A. Donenfeld wrote:
> On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
>> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
>>> That failure is unrelated to the ident mapping issue Peter and
>>> I discussed. The original failure is described in the commit message:
>>> decompression clobbers the data, so sd->next points to garbage.
>>
>> Right
> 
> So with that understanding confirmed, I'm confused at your surprise that
> hpa's unrelated fix to the different issue didn't fix this issue.
> 

If decompression does clobber the data, then we *also* need to figure 
out why that is. There are basically three possibilities:

1. If physical KASLR is NOT used:

	a. The boot loader doesn't honor the kernel safe area properly;
	b. Somewhere in the process a bug in the calculation of the
	   kernel safe area has crept in.

2. If physical KASLR IS used:

	The decompressor doesn't correctly keep track of nor relocate
	all the keep-out zones before picking a target address.

One is a bootloader bug, two is a kernel bug. My guess is (2) is the 
culprit, but (1b) should be checked, too.

	-hpa
Mika Penttilä Jan. 1, 2023, 4:55 a.m. UTC | #38
On 1.1.2023 6.33, H. Peter Anvin wrote:
> 
> 
> On 12/31/22 10:22, Jason A. Donenfeld wrote:
>> On Sat, Dec 31, 2022 at 03:24:32PM +0100, Borislav Petkov wrote:
>>> On Sat, Dec 31, 2022 at 02:51:28PM +0100, Jason A. Donenfeld wrote:
>>>> That failure is unrelated to the ident mapping issue Peter and
>>>> I discussed. The original failure is described in the commit message:
>>>> decompression clobbers the data, so sd->next points to garbage.
>>>
>>> Right
>>
>> So with that understanding confirmed, I'm confused at your surprise that
>> hpa's unrelated fix to the different issue didn't fix this issue.
>>
> 
> If decompression does clobber the data, then we *also* need to figure 
> out why that is. There are basically three possibilities:
> 
> 1. If physical KASLR is NOT used:
> 
>      a. The boot loader doesn't honor the kernel safe area properly;
>      b. Somewhere in the process a bug in the calculation of the
>         kernel safe area has crept in.
> 
> 2. If physical KASLR IS used:
> 
>      The decompressor doesn't correctly keep track of nor relocate
>      all the keep-out zones before picking a target address.

Seems setup_data is not included in those mem_avoid regions.

> 
> One is a bootloader bug, two is a kernel bug. My guess is (2) is the 
> culprit, but (1b) should be checked, too.
> 
>      -hpa
> 


--Mika
H. Peter Anvin Jan. 1, 2023, 5:13 a.m. UTC | #39
On 12/31/22 20:55, Mika Penttilä wrote:
>>
>> If decompression does clobber the data, then we *also* need to figure 
>> out why that is. There are basically three possibilities:
>>
>> 1. If physical KASLR is NOT used:
>>
>>      a. The boot loader doesn't honor the kernel safe area properly;
>>      b. Somewhere in the process a bug in the calculation of the
>>         kernel safe area has crept in.
>>
>> 2. If physical KASLR IS used:
>>
>>      The decompressor doesn't correctly keep track of nor relocate
>>      all the keep-out zones before picking a target address.
> 
> Seems setup_data is not included in those mem_avoid regions.
> 

[facepalm]


>>
>> One is a bootloader bug, two is a kernel bugs. My guess is (2) is the 
>> culprit, but (1b) should be checked, too.
>>

Correction: two are kernel bugs, i.e. (1b) and (2) are both kernel bugs.

	-hpa
Borislav Petkov Jan. 2, 2023, 5:50 a.m. UTC | #40
On Sat, Dec 31, 2022 at 07:21:06PM -0800, H. Peter Anvin wrote:
> As far as the decompression itself goes, it should only a problem if we are
> using physical KASLR since otherwise the kernel has a guaranteed safe zone
> already allocated by the boot loader. However, if physical KASLR is in use,

No KASLR in Jason's config AFAICT:

$ grep RANDOMIZE .config
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set

> then the decompressor needs to know everything there is to know about the
> memory map.

Yeah, we do have that but as you folks establish later in the thread, those
setup_data regions would need to be avoided too. ;-\
 
> However, there also seems to be some kind of interaction with AMD SEV-SNP.
> 
> 
> The bug appears to have been introduced by:
> 
> b57feed2cc2622ae14b2fa62f19e973e5e0a60cf
> x86/compressed/64: Add identity mappings for setup_data entries
> https://lore.kernel.org/r/TYCPR01MB694815CD815E98945F63C99183B49@TYCPR01MB6948.jpnprd01.prod.outlook.com
> 
> ... which was included in version 5.19, so it is relatively recent.

Right. We need that for the CC blob:

b190a043c49a ("x86/sev: Add SEV-SNP feature detection/setup")

> For a small amount of setup_data, the solution of just putting it next to
> the command line makes a lot of sense, and should be safe indefinitely.

Ok.

Thx.
Borislav Petkov Jan. 2, 2023, 6:01 a.m. UTC | #41
On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> It would probably be a good idea to add a "maximum physical address for
> initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> right now that those fields are being identity-mapped in the decompressor,
> and that means that if 48-bit addressing is used, physical memory may extend
> past the addressable range.

Yeah, we will probably need that too.

Btw, looka here - it can't get any more obvious than that after dumping
setup_data too:

early console in setup code
early console in extract_kernel
input_data: 0x00000000040f92bf
input_len: 0x0000000000f1c325
output: 0x0000000001000000
output_len: 0x0000000003c5e7d8
kernel_total_size: 0x0000000004428000
needed_size: 0x0000000004600000
boot_params->hdr.setup_data: 0x00000000010203b0
trampoline_32bit: 0x000000000009d000

Decompressing Linux... Parsing ELF... done.
Booting the kernel.
<EOF>

Aligning them vertically:

output:				0x0000000001000000
output_len:			0x0000000003c5e7d8
kernel_total_size:		0x0000000004428000
needed_size:			0x0000000004600000
boot_params->hdr.setup_data:	0x00000000010203b0
Borislav Petkov Jan. 2, 2023, 6:17 a.m. UTC | #42
On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > It would probably be a good idea to add a "maximum physical address for
> > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > right now that those fields are being identity-mapped in the decompressor,
> > and that means that if 48-bit addressing is used, physical memory may extend
> > past the addressable range.
> 
> Yeah, we will probably need that too.
> 
> Btw, looka here - it can't get any more obvious than that after dumping
> setup_data too:
> 
> early console in setup code
> early console in extract_kernel
> input_data: 0x00000000040f92bf
> input_len: 0x0000000000f1c325
> output: 0x0000000001000000
> output_len: 0x0000000003c5e7d8
> kernel_total_size: 0x0000000004428000
> needed_size: 0x0000000004600000
> boot_params->hdr.setup_data: 0x00000000010203b0
> trampoline_32bit: 0x000000000009d000
> 
> Decompressing Linux... Parsing ELF... done.
> Booting the kernel.
> <EOF>
> 
> Aligning them vertically:
> 
> output:				0x0000000001000000
> output_len:			0x0000000003c5e7d8
> kernel_total_size:		0x0000000004428000
> needed_size:			0x0000000004600000
> boot_params->hdr.setup_data:	0x00000000010203b0

Ok, waait a minute:

============    ============
Field name:     pref_address
Type:           read (reloc)
Offset/size:    0x258/8
Protocol:       2.10+
============    ============

  This field, if nonzero, represents a preferred load address for the
  kernel.  A relocating bootloader should attempt to load at this
  address if possible.

  A non-relocatable kernel will unconditionally move itself and to run
  at this address.

so a kernel loader (qemu in this case) already knows where the kernel goes:

boot_params->hdr.setup_data: 0x0000000001020450
boot_params->hdr.pref_address: 0x0000000001000000
				^^^^^^^^^^^^^^^^^

now, considering that same kernel loader (qemu) knows how big that kernel is:

kernel_total_size: 0x0000000004428000

should that loader *not* put anything that the kernel will use in the range

pref_addr + kernel_total_size

?
Ard Biesheuvel Jan. 2, 2023, 9:32 a.m. UTC | #43
On Mon, 2 Jan 2023 at 07:17, Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Jan 02, 2023 at 07:01:50AM +0100, Borislav Petkov wrote:
> > On Sat, Dec 31, 2022 at 07:31:21PM -0800, H. Peter Anvin wrote:
> > > It would probably be a good idea to add a "maximum physical address for
> > > initrd/setup_data/cmdline" field to struct kernel_info, though. It appears
> > > right now that those fields are being identity-mapped in the decompressor,
> > > and that means that if 48-bit addressing is used, physical memory may extend
> > > past the addressable range.
> >
> > Yeah, we will probably need that too.
> >
> > Btw, looka here - it can't get any more obvious than that after dumping
> > setup_data too:
> >
> > early console in setup code
> > early console in extract_kernel
> > input_data: 0x00000000040f92bf
> > input_len: 0x0000000000f1c325
> > output: 0x0000000001000000
> > output_len: 0x0000000003c5e7d8
> > kernel_total_size: 0x0000000004428000
> > needed_size: 0x0000000004600000
> > boot_params->hdr.setup_data: 0x00000000010203b0
> > trampoline_32bit: 0x000000000009d000
> >
> > Decompressing Linux... Parsing ELF... done.
> > Booting the kernel.
> > <EOF>
> >
> > Aligning them vertically:
> >
> > output:                               0x0000000001000000
> > output_len:                   0x0000000003c5e7d8
> > kernel_total_size:            0x0000000004428000
> > needed_size:                  0x0000000004600000
> > boot_params->hdr.setup_data:  0x00000000010203b0
>
> Ok, waait a minute:
>
> ============    ============
> Field name:     pref_address
> Type:           read (reloc)
> Offset/size:    0x258/8
> Protocol:       2.10+
> ============    ============
>
>   This field, if nonzero, represents a preferred load address for the
>   kernel.  A relocating bootloader should attempt to load at this
>   address if possible.
>
>   A non-relocatable kernel will unconditionally move itself and to run
>   at this address.
>
> so a kernel loader (qemu in this case) already knows where the kernel goes:
>
> boot_params->hdr.setup_data: 0x0000000001020450
> boot_params->hdr.pref_address: 0x0000000001000000
>                                 ^^^^^^^^^^^^^^^^^
>
> now, considering that same kernel loader (qemu) knows how big that kernel is:
>
> kernel_total_size: 0x0000000004428000
>
> should that loader *not* put anything that the kernel will use in the range
>
> pref_addr + kernel_total_size
>

This seems to be related to another issue that was discussed in the
context of this change, but affecting EFI boot not legacy BIOS boot
[0].

So, in a nutshell, we have the following pieces:
- QEMU, which manages a directory of files and other data blobs, and
exposes them via its fw_cfg interface.
- SeaBIOS, which invokes the fw_cfg interface to load the 'kernel'
blob at its preferred address
- The boot code in the kernel, which interprets the various fields in
the setup header to figure out where the compressed image lives etc

So the problem here, which applies to SETUP_DTB as well as
SETUP_RNG_SEED, is that the internal file representation of the kernel
blob (which does not have an absolute address at this point, it's just
a file in the fw_cfg filesystem) is augmented with:
1) setup_data linked-list entries carrying absolute addresses that are
assumed to be valid once SeaBIOS loads the file to memory
2) DTB and/or RNG seed blobs appended to the compressed 'kernel' blob,
but without updating that file's internal metadata

Issue 1) is what broke EFI boot, given that EFI interprets the kernel
blob as a PE/COFF image and hands it to the Loadimage() boot service,
which has no awareness of boot_params or setup_data and so just
ignores it and loads the image at an arbitrary address, resulting in
setup_data absolute address values pointing to bogus places.

It seems that now, we have another issue 2), where the fw_cfg view of
the file size goes out of sync with the compressed image's own view of
its size.

As a fix for issue 1), we explored another solution, which was to
allocate fixed areas in memory for the RNG seed, so that the absolute
address added to setup_data is guaranteed to be correct regardless of
where the compressed image is loaded, but that was shot down for other
reasons, and we ended up enabling this feature only for legacy BIOS
boot. But apparently, this approach has other issues so perhaps it is
better to revisit that solution again.

So instead of appending data to the compressed image and assuming that
it will stay in place, create or extend a memory reservation
elsewhere, and refer to its absolute address in setup_data.
Borislav Petkov Jan. 2, 2023, 1:36 p.m. UTC | #44
On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> So instead of appending data to the compressed image and assuming that
> it will stay in place, create or extend a memory reservation
> elsewhere, and refer to its absolute address in setup_data.

From my limited experience with all those boot protocols, I'd say hardcoding
stuff is always a bad idea. But, we already more or less hardcode, or rather
codify through the setup header contract how stuff needs to get accessed.

And yeah, maybe specifying an absolute address and size for a blob of data and
putting that address and size in the setup header so that all the parties
involved are where what is, is probably better.

But WTH do I know...
Ard Biesheuvel Jan. 2, 2023, 3:03 p.m. UTC | #45
On Mon, 2 Jan 2023 at 14:37, Borislav Petkov <bp@alien8.de> wrote:
>
> On Mon, Jan 02, 2023 at 10:32:03AM +0100, Ard Biesheuvel wrote:
> > So instead of appending data to the compressed image and assuming that
> > it will stay in place, create or extend a memory reservation
> > elsewhere, and refer to its absolute address in setup_data.
>
> From my limited experience with all those boot protocols, I'd say hardcoding
> stuff is always a bad idea. But, we already more or less hardcode, or rather
> codify through the setup header contract how stuff needs to get accessed.
>
> And yeah, maybe specifying an absolute address and size for a blob of data and
> putting that address and size in the setup header so that all the parties
> involved are where what is, is probably better.
>

Exactly. In the EFI case, this was problematic because we would need
to introduce a new way to pass memory reservations between QEMU and
the firmware. But I don't think that issue should affect legacy BIOS
boot, and we could just reserve the memory in the E820 table AFAIK.
diff mbox series

Patch

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 78cc131926..628fd2b2e9 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1077,6 +1077,36 @@  void x86_load_linux(X86MachineState *x86ms,
     }
     fclose(f);
 
+    /* If a setup_data is going to be used, make sure that the bootloader won't
+     * decompress into it and clobber those bytes. */
+    if (dtb_filename || !legacy_no_rng_seed) {
+        uint32_t payload_offset = ldl_p(setup + 0x248);
+        uint32_t payload_length = ldl_p(setup + 0x24c);
+        uint32_t target_address = ldl_p(setup + 0x258);
+        uint32_t decompressed_length = ldl_p(kernel + payload_offset + payload_length - 4);
+
+        uint32_t estimated_setup_data_length = 4096 * 16;
+        uint32_t start_setup_data = prot_addr + kernel_size;
+        uint32_t end_setup_data = start_setup_data + estimated_setup_data_length;
+        uint32_t start_target = target_address;
+        uint32_t end_target = target_address + decompressed_length;
+
+        if ((start_setup_data >= start_target && start_setup_data < end_target) ||
+            (end_setup_data >= start_target && end_setup_data < end_target)) {
+            uint32_t padded_size = target_address + decompressed_length - prot_addr;
+
+            /* The early stage can't address past around 64 MB from the original
+             * mapping, so just give up in that case. */
+            if (padded_size < 62 * 1024 * 1024)
+                kernel_size = padded_size;
+            else {
+                fprintf(stderr, "qemu: Kernel image too large to hold setup_data\n");
+                dtb_filename = NULL;
+                legacy_no_rng_seed = true;
+            }
+        }
+    }
+
     /* append dtb to kernel */
     if (dtb_filename) {
         if (protocol < 0x209) {