Message ID | 20221230220725.618763-1-Jason@zx2c4.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [qemu,v3] x86: don't let decompressed kernel image clobber setup_data | expand |
On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the compressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Visually, what happens now is that QEMU appends setup_data to the kernel > image: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > The problem is that this decompresses to 0x1000000 (one more zero). So > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > > The decompressed kernel seemingly overwriting the compressed kernel > image isn't a problem, because that gets relocated to a higher address > early on in the boot process, at the end of startup_64. setup_data, > however, stays in the same place, since those links are self referential > and nothing fixes them up. So the decompressed kernel clobbers it. > > Fix this by appending setup_data to the cmdline blob rather than the > kernel image blob, which remains at a lower address that won't get > clobbered. > > This could have been done by overwriting the initrd blob instead, but > that poses big difficulties, such as no longer being able to use memory > mapped files for initrd, hurting performance, and, more importantly, the > initrd address calculation is hard coded in qboot, and it always grows > down rather than up, which means lots of brittle semantics would have to > be changed around, incurring more complexity. In contrast, using cmdline > is simple and doesn't interfere with anything. > > The microvm machine has a gross hack where it fiddles with fw_cfg data > after the fact. So this hack is updated to account for this appending, > by reserving some bytes. > > Cc: x86@kernel.org > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Eric Biggers <ebiggers@kernel.org> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> For what it's worth: Tested-by: Eric Biggers <ebiggers@google.com> - Eric
Hi Jason! Am 30.12.22 um 23:07 schrieb Jason A. Donenfeld: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the compressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Visually, what happens now is that QEMU appends setup_data to the kernel > image: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > The problem is that this decompresses to 0x1000000 (one more zero). So > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > > The decompressed kernel seemingly overwriting the compressed kernel > image isn't a problem, because that gets relocated to a higher address > early on in the boot process, at the end of startup_64. setup_data, > however, stays in the same place, since those links are self referential > and nothing fixes them up. So the decompressed kernel clobbers it. I just ran into this very issue yesterday when trying to boot a 6.1 kernel. pipacs pointed me to some changes of yours[1] which confirmed, the issue is related to the additional setup_data entries, as adding, e.g., '-M pc-i440fx-7.0' to the QEMU command line made the bug vanish (as QEMU then omits adding the random seed setup_data entries) . [1] https://github.com/qemu/qemu/commit/67f7e426e538 After digging a while I found this thread and it fixes the issue for me, thereby: Tested-by: Mathias Krause <minipli@grsecurity.net> Thanks, Mathias > [snip]
Hi Michael, Could you queue up this patch and mark it as a fix for 7.2.1? It is a straight-up bug fix for a 7.2 regression that's now affected several users. - It has two Tested-by tags on the thread. - hpa, the maintainer of the kernel side of this, confirmed on one of the various tributary threads that this approach is a correct one. - It doesn't introduce any new functionality. For your convenience, you can grab this out of lore here: https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ Or if you want to yolo it: curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s It's now sat silent on the mailing list for a while. So let's please get this committed and backported so that the bug reports stop coming in. Thanks, Jason
On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > Hi Michael, > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > straight-up bug fix for a 7.2 regression that's now affected several > users. OK. In the future pls cc me if you want me to merge a patch. Thanks! > - It has two Tested-by tags on the thread. > - hpa, the maintainer of the kernel side of this, confirmed on one of > the various tributary threads that this approach is a correct one. > - It doesn't introduce any new functionality. > > For your convenience, you can grab this out of lore here: > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > Or if you want to yolo it: > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > It's now sat silent on the mailing list for a while. So let's please get > this committed and backported so that the bug reports stop coming in. > > Thanks, > Jason > >
Hi Michael, On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote: > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > > Hi Michael, > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > > straight-up bug fix for a 7.2 regression that's now affected several > > users. > > OK. In the future pls cc me if you want me to merge a patch. Thanks! > > > - It has two Tested-by tags on the thread. > > - hpa, the maintainer of the kernel side of this, confirmed on one of > > the various tributary threads that this approach is a correct one. > > - It doesn't introduce any new functionality. > > > > For your convenience, you can grab this out of lore here: > > > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > > > Or if you want to yolo it: > > > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > > > It's now sat silent on the mailing list for a while. So let's please get > > this committed and backported so that the bug reports stop coming in. > > This patch still isn't on QEMU's master branch. What happened to it? - Eric
On 30/12/22 23:07, Jason A. Donenfeld wrote: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the compressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Visually, what happens now is that QEMU appends setup_data to the kernel > image: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > The problem is that this decompresses to 0x1000000 (one more zero). So > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > > The decompressed kernel seemingly overwriting the compressed kernel > image isn't a problem, because that gets relocated to a higher address > early on in the boot process, at the end of startup_64. setup_data, > however, stays in the same place, since those links are self referential > and nothing fixes them up. So the decompressed kernel clobbers it. > > Fix this by appending setup_data to the cmdline blob rather than the > kernel image blob, which remains at a lower address that won't get > clobbered. > > This could have been done by overwriting the initrd blob instead, but > that poses big difficulties, such as no longer being able to use memory > mapped files for initrd, hurting performance, and, more importantly, the > initrd address calculation is hard coded in qboot, and it always grows > down rather than up, which means lots of brittle semantics would have to > be changed around, incurring more complexity. In contrast, using cmdline > is simple and doesn't interfere with anything. > > The microvm machine has a gross hack where it fiddles with fw_cfg data > after the fact. So this hack is updated to account for this appending, > by reserving some bytes. > > Cc: x86@kernel.org > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Eric Biggers <ebiggers@kernel.org> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- > Changes v2->v3: > - Fix mistakes in string handling. > Changes v1->v2: > - Append setup_data to cmdline instead of kernel image. > > hw/i386/microvm.c | 13 ++++++---- > hw/i386/x86.c | 50 +++++++++++++++++++-------------------- > hw/nvram/fw_cfg.c | 9 +++++++ > include/hw/i386/microvm.h | 5 ++-- > include/hw/nvram/fw_cfg.h | 9 +++++++ > 5 files changed, 54 insertions(+), 32 deletions(-) > diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c > index a00881bc64..432754eda4 100644 > --- a/hw/nvram/fw_cfg.c > +++ b/hw/nvram/fw_cfg.c > @@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len) > fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true); > } > > +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key) > +{ > + int arch = !!(key & FW_CFG_ARCH_LOCAL); > + > + key &= FW_CFG_ENTRY_MASK; > + assert(key < fw_cfg_max_entry(s)); > + return s->entries[arch][key].data; Shouldn't it be safer to provide a size argument, and return NULL if s->entries[arch][key].len < size? Maybe this API should return a (casted) const pointer, so the only way to update the key is via fw_cfg_add_bytes(). > +} > + > diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h > index 2e503904dc..990dcdbb2e 100644 > --- a/include/hw/nvram/fw_cfg.h > +++ b/include/hw/nvram/fw_cfg.h > @@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key, > void *data, size_t len, > bool read_only); > > +/** > + * fw_cfg_read_bytes_ptr: > + * @s: fw_cfg device being modified > + * @key: selector key value for new fw_cfg item > + * > + * Reads an existing fw_cfg data pointer. > + */ > +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key); > + > /** > * fw_cfg_add_string: > * @s: fw_cfg device being modified
On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote: > Hi Michael, > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote: > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > > > Hi Michael, > > > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > > > straight-up bug fix for a 7.2 regression that's now affected several > > > users. > > > > OK. In the future pls cc me if you want me to merge a patch. Thanks! > > > > > - It has two Tested-by tags on the thread. > > > - hpa, the maintainer of the kernel side of this, confirmed on one of > > > the various tributary threads that this approach is a correct one. > > > - It doesn't introduce any new functionality. > > > > > > For your convenience, you can grab this out of lore here: > > > > > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > > > > > Or if you want to yolo it: > > > > > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > > > > > It's now sat silent on the mailing list for a while. So let's please get > > > this committed and backported so that the bug reports stop coming in. > > > > > This patch still isn't on QEMU's master branch. What happened to it? > > - Eric Indeed though I remember picking it up. Tagged again now. Thanks!
On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote: > > Hi Michael, > > > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote: > > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > > > > Hi Michael, > > > > > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > > > > straight-up bug fix for a 7.2 regression that's now affected several > > > > users. > > > > > > OK. In the future pls cc me if you want me to merge a patch. Thanks! > > > > > > > - It has two Tested-by tags on the thread. > > > > - hpa, the maintainer of the kernel side of this, confirmed on one of > > > > the various tributary threads that this approach is a correct one. > > > > - It doesn't introduce any new functionality. > > > > > > > > For your convenience, you can grab this out of lore here: > > > > > > > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > > > > > > > Or if you want to yolo it: > > > > > > > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > > > > > > > It's now sat silent on the mailing list for a while. So let's please get > > > > this committed and backported so that the bug reports stop coming in. > > > > > > > > This patch still isn't on QEMU's master branch. What happened to it? > > > > - Eric > > Indeed though I remember picking it up. Tagged again now. Thanks! Thanks. What branch is this in? I didn't see it on: https://gitlab.com/mstredhat/qemu/-/branches/active https://github.com/mstsirkin/qemu/branches
On Mon, Jan 23, 2023 at 06:37:21AM -0600, Jason A. Donenfeld wrote: > On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote: > > > Hi Michael, > > > > > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote: > > > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > > > > > Hi Michael, > > > > > > > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > > > > > straight-up bug fix for a 7.2 regression that's now affected several > > > > > users. > > > > > > > > OK. In the future pls cc me if you want me to merge a patch. Thanks! > > > > > > > > > - It has two Tested-by tags on the thread. > > > > > - hpa, the maintainer of the kernel side of this, confirmed on one of > > > > > the various tributary threads that this approach is a correct one. > > > > > - It doesn't introduce any new functionality. > > > > > > > > > > For your convenience, you can grab this out of lore here: > > > > > > > > > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > > > > > > > > > Or if you want to yolo it: > > > > > > > > > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > > > > > > > > > It's now sat silent on the mailing list for a while. So let's please get > > > > > this committed and backported so that the bug reports stop coming in. > > > > > > > > > > > This patch still isn't on QEMU's master branch. What happened to it? > > > > > > - Eric > > > > Indeed though I remember picking it up. Tagged again now. Thanks! > > Thanks. What branch is this in? I didn't see it on: > https://gitlab.com/mstredhat/qemu/-/branches/active > https://github.com/mstsirkin/qemu/branches I don't use github really. And it was not pushed to gitlab as I was figuring out issues with other patches before starting CI as CI minutes are limited. BTW as checkpatch was unhappy I applied a fixup - making checkpatch happier and in the process the code change a bit smaller. If you want to do cleanups on top be my guest but pls make it pass checkpatch. Thanks! commit a00d99e04c4481fca3ee2d7c40d42993b7b059c2 Author: Michael S. Tsirkin <mst@redhat.com> Date: Sat Jan 28 06:08:43 2023 -0500 fixup! x86: don't let decompressed kernel image clobber setup_data diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c index 1b19d28c02..29f30dd6d3 100644 --- a/hw/i386/microvm.c +++ b/hw/i386/microvm.c @@ -378,7 +378,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) MicrovmMachineState *mms = MICROVM_MACHINE(machine); BusState *bus; BusChild *kid; - char *cmdline, *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA); + char *cmdline, *existing_cmdline; size_t len; /* @@ -388,6 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) * Yes, this is a hack, but one that heavily improves the UX without * introducing any significant issues. */ + existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA); cmdline = g_strdup(existing_cmdline); bus = sysbus_get_default(); QTAILQ_FOREACH(kid, &bus->children, sibling) { @@ -413,10 +414,11 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) } len = strlen(cmdline); - if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) + if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) { fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n"); - else + } else { memcpy(existing_cmdline, cmdline, len + 1); + } g_free(cmdline); } diff --git a/hw/i386/x86.c b/hw/i386/x86.c index b57a993596..eaff4227bd 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -803,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms, bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled; uint16_t protocol; int setup_size, kernel_size, cmdline_size; - int dtb_size; + int dtb_size, setup_data_offset; uint32_t initrd_max; uint8_t header[8192], *setup, *kernel; hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0; @@ -818,8 +818,10 @@ void x86_load_linux(X86MachineState *x86ms, SevKernelLoaderContext sev_load_ctx = {}; enum { RNG_SEED_LENGTH = 32 }; - /* Add the NUL terminator, some padding for the microvm cmdline fiddling - * hack, and then align to 16 bytes as a paranoia measure */ + /* + * Add the NUL terminator, some padding for the microvm cmdline fiddling + * hack, and then align to 16 bytes as a paranoia measure + */ cmdline_size = (strlen(machine->kernel_cmdline) + 1 + VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15; /* Make a copy, since we might append arbitrary bytes to it later. */ @@ -1090,22 +1092,24 @@ void x86_load_linux(X86MachineState *x86ms, exit(1); } + setup_data_offset = cmdline_size; cmdline_size += sizeof(SetupData) + dtb_size; kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size); - setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size); + setup_data = (void *)kernel_cmdline + setup_data_offset; setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline); + first_setup_data = cmdline_addr + setup_data_offset; setup_data->type = cpu_to_le32(SETUP_DTB); setup_data->len = cpu_to_le32(dtb_size); load_image_size(dtb_filename, setup_data->data, dtb_size); } if (!legacy_no_rng_seed && protocol >= 0x209) { + setup_data_offset = cmdline_size; cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH; kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size); - setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH); + setup_data = (void *)kernel_cmdline + setup_data_offset; setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline); + first_setup_data = cmdline_addr + setup_data_offset; setup_data->type = cpu_to_le32(SETUP_RNG_SEED); setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
On Sat, Jan 28, 2023 at 06:15:03AM -0500, Michael S. Tsirkin wrote: > On Mon, Jan 23, 2023 at 06:37:21AM -0600, Jason A. Donenfeld wrote: > > On Mon, Jan 23, 2023 at 6:12 AM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Sun, Jan 22, 2023 at 08:21:30PM -0800, Eric Biggers wrote: > > > > Hi Michael, > > > > > > > > On Tue, Jan 10, 2023 at 12:50:42PM -0500, Michael S. Tsirkin wrote: > > > > > On Tue, Jan 10, 2023 at 04:34:49PM +0100, Jason A. Donenfeld wrote: > > > > > > Hi Michael, > > > > > > > > > > > > Could you queue up this patch and mark it as a fix for 7.2.1? It is a > > > > > > straight-up bug fix for a 7.2 regression that's now affected several > > > > > > users. > > > > > > > > > > OK. In the future pls cc me if you want me to merge a patch. Thanks! > > > > > > > > > > > - It has two Tested-by tags on the thread. > > > > > > - hpa, the maintainer of the kernel side of this, confirmed on one of > > > > > > the various tributary threads that this approach is a correct one. > > > > > > - It doesn't introduce any new functionality. > > > > > > > > > > > > For your convenience, you can grab this out of lore here: > > > > > > > > > > > > https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/ > > > > > > > > > > > > Or if you want to yolo it: > > > > > > > > > > > > curl https://lore.kernel.org/lkml/20221230220725.618763-1-Jason@zx2c4.com/raw | git am -s > > > > > > > > > > > > It's now sat silent on the mailing list for a while. So let's please get > > > > > > this committed and backported so that the bug reports stop coming in. > > > > > > > > > > > > > > This patch still isn't on QEMU's master branch. What happened to it? > > > > > > > > - Eric > > > > > > Indeed though I remember picking it up. Tagged again now. Thanks! > > > > Thanks. What branch is this in? I didn't see it on: > > https://gitlab.com/mstredhat/qemu/-/branches/active > > https://github.com/mstsirkin/qemu/branches > > I don't use github really. And it was not pushed to gitlab as I was > figuring out issues with other patches before starting CI as CI minutes > are limited. QEMU CI config as of a few months ago, pushing to gitlab will *not* start CI pipelines. You need to explicitly use opt-in to it when pushing by using 'git push -o ci.variable=QEMU_CI=1' to create a pipeline with all jobs manual or QEMU_CI=2 to create a pipeline and immediately run all jobs. Alternatively use the web UI to start the pipeline, again setting QEMU_CI=1|2 So don't let CI minutes concerns dissuade from pushing to gitlab merely to publish the code. With regards, Daniel
Hi Jason, On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > The setup_data links are appended to the compressed kernel image. Since > the kernel image is typically loaded at 0x100000, setup_data lives at > `0x100000 + compressed_size`, which does not get relocated during the > kernel's boot process. > > The kernel typically decompresses the image starting at address > 0x1000000 (note: there's one more zero there than the compressed image > above). This usually is fine for most kernels. > > However, if the compressed image is actually quite large, then > setup_data will live at a `0x100000 + compressed_size` that extends into > the decompressed zone at 0x1000000. In other words, if compressed_size > is larger than `0x1000000 - 0x100000`, then the decompression step will > clobber setup_data, resulting in crashes. > > Visually, what happens now is that QEMU appends setup_data to the kernel > image: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > The problem is that this decompresses to 0x1000000 (one more zero). So > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > kernel image setup_data > |--------------------------||----------------| > 0x100000 0x100000+l1 0x100000+l1+l2 > > d e c o m p r e s s e d k e r n e l > |-------------------------------------------------------------| > 0x1000000 0x1000000+l3 > > The decompressed kernel seemingly overwriting the compressed kernel > image isn't a problem, because that gets relocated to a higher address > early on in the boot process, at the end of startup_64. setup_data, > however, stays in the same place, since those links are self referential > and nothing fixes them up. So the decompressed kernel clobbers it. > > Fix this by appending setup_data to the cmdline blob rather than the > kernel image blob, which remains at a lower address that won't get > clobbered. > > This could have been done by overwriting the initrd blob instead, but > that poses big difficulties, such as no longer being able to use memory > mapped files for initrd, hurting performance, and, more importantly, the > initrd address calculation is hard coded in qboot, and it always grows > down rather than up, which means lots of brittle semantics would have to > be changed around, incurring more complexity. In contrast, using cmdline > is simple and doesn't interfere with anything. > > The microvm machine has a gross hack where it fiddles with fw_cfg data > after the fact. So this hack is updated to account for this appending, > by reserving some bytes. > > Cc: x86@kernel.org > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > Cc: H. Peter Anvin <hpa@zytor.com> > Cc: Borislav Petkov <bp@alien8.de> > Cc: Eric Biggers <ebiggers@kernel.org> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> I apologize if this has already been reported/fixed already (I did a brief search on lore.kernel.org) or if my terminology is not as precise as it could be, I am a little out of my element here :) After this change as commit eac7a7791b ("x86: don't let decompressed kernel image clobber setup_data") in QEMU master, I am no longer able to boot a kernel directly through OVMF using '-append' + '-initrd' + '-kernel'. Instead, systemd-boot tries to start the distribution's kernel, which fails with: Error registering initrd: Already started This can be reproduced with just a defconfig Linux kernel (I used 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for boot testing [1], and OVMF. Prior to this change, the kernel would start up but after, I am dumped into the UEFI shell (as there is no bootloader). The QEMU command I used was: $ qemu-system-x86_64 \ -kernel arch/x86_64/boot/bzImage \ -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \ -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \ -object rng-random,filename=/dev/urandom,id=rng0 \ -device virtio-rng-pci \ -cpu host \ -enable-kvm \ -smp 8 \ -display none \ -initrd .../boot-utils/images/x86_64/rootfs.cpio \ -m 512m \ -nodefaults \ -no-reboot \ -serial mon:stdio Not sure if the OVMF version will matter but just in case: $ pacman -Q edk2-ovmf edk2-ovmf 202211-3 I did see a few fixes on qemu-devel for this patch such as [2] but none I found fixed the issue for me. If there is any additional information I can provide or patches I can test, please let me know. [1]: https://github.com/ClangBuiltLinux/boot-utils/blob/fec03ef13519e26ac1f226e404e1620a142d0e40/images/x86_64/rootfs.cpio.zst [2]: https://lore.kernel.org/20230207224847.94429-1-Jason@zx2c4.com/ Cheers, Nathan # bad: [969d09c3a6186c0a4bc8a41db0c1aba1c76081fc] Merge tag 'pull-aspeed-20230207' of https://github.com/legoater/qemu into staging # good: [b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56] Update VERSION for v7.2.0 git bisect start '969d09c3a6186c0a4bc8a41db0c1aba1c76081fc' 'b67b00e6b4c7831a3f5bc684bc0df7a9bfd1bd56' # good: [38cb336fe9d47507da5177c3d349532d0af6885e] hw/arm/gumstix: Use the IEC binary prefix definitions git bisect good 38cb336fe9d47507da5177c3d349532d0af6885e # bad: [864a3fa439276148b6d7abcf2d43ee8dbe4c4062] monitor: Rename misc.c to hmp-target.c git bisect bad 864a3fa439276148b6d7abcf2d43ee8dbe4c4062 # good: [e471a8c9850f1af0c1bc5768ca28285348cdd6c5] target/riscv: Trap on writes to stimecmp from VS when hvictl.VTI=1 git bisect good e471a8c9850f1af0c1bc5768ca28285348cdd6c5 # bad: [c1a9ac9bdeea213162a76f7b9e2f876c89a50a94] tests: acpi: cleanup use_uefi argument usage git bisect bad c1a9ac9bdeea213162a76f7b9e2f876c89a50a94 # good: [dc575b5e0300a7a375b4e4501a17ada21e9a6d10] hw/i2c/bitbang_i2c: Change state calling bitbang_i2c_set_state() helper git bisect good dc575b5e0300a7a375b4e4501a17ada21e9a6d10 # good: [3b07a936d3bfe97b07ddffcfbb532985a88033dd] target/arm: Look up ARMCPRegInfo at runtime git bisect good 3b07a936d3bfe97b07ddffcfbb532985a88033dd # good: [d395b18dce82855d03d934e30a515caf5a10a885] hw/acpi/acpi_dev_interface: Remove unused parameter from AcpiDeviceIfClass::madt_cpu git bisect good d395b18dce82855d03d934e30a515caf5a10a885 # bad: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data git bisect bad eac7a7791bb6d719233deed750034042318ffd56 # good: [8a8c9c3a747f77e664fa2288735b45a9d750be75] hw/pci-host: Use register definitions from PCI standard git bisect good 8a8c9c3a747f77e664fa2288735b45a9d750be75 # good: [8a7c606016d283a1716290c657f6f45bc7c4d817] intel-iommu: Document iova_tree git bisect good 8a7c606016d283a1716290c657f6f45bc7c4d817 # first bad commit: [eac7a7791bb6d719233deed750034042318ffd56] x86: don't let decompressed kernel image clobber setup_data
Hi Nathan (and MST), On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote: > > Hi Jason, > > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > > The setup_data links are appended to the compressed kernel image. Since > > the kernel image is typically loaded at 0x100000, setup_data lives at > > `0x100000 + compressed_size`, which does not get relocated during the > > kernel's boot process. > > > > The kernel typically decompresses the image starting at address > > 0x1000000 (note: there's one more zero there than the compressed image > > above). This usually is fine for most kernels. > > > > However, if the compressed image is actually quite large, then > > setup_data will live at a `0x100000 + compressed_size` that extends into > > the decompressed zone at 0x1000000. In other words, if compressed_size > > is larger than `0x1000000 - 0x100000`, then the decompression step will > > clobber setup_data, resulting in crashes. > > > > Visually, what happens now is that QEMU appends setup_data to the kernel > > image: > > > > kernel image setup_data > > |--------------------------||----------------| > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > The problem is that this decompresses to 0x1000000 (one more zero). So > > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > > > kernel image setup_data > > |--------------------------||----------------| > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > d e c o m p r e s s e d k e r n e l > > |-------------------------------------------------------------| > > 0x1000000 0x1000000+l3 > > > > The decompressed kernel seemingly overwriting the compressed kernel > > image isn't a problem, because that gets relocated to a higher address > > early on in the boot process, at the end of startup_64. setup_data, > > however, stays in the same place, since those links are self referential > > and nothing fixes them up. So the decompressed kernel clobbers it. > > > > Fix this by appending setup_data to the cmdline blob rather than the > > kernel image blob, which remains at a lower address that won't get > > clobbered. > > > > This could have been done by overwriting the initrd blob instead, but > > that poses big difficulties, such as no longer being able to use memory > > mapped files for initrd, hurting performance, and, more importantly, the > > initrd address calculation is hard coded in qboot, and it always grows > > down rather than up, which means lots of brittle semantics would have to > > be changed around, incurring more complexity. In contrast, using cmdline > > is simple and doesn't interfere with anything. > > > > The microvm machine has a gross hack where it fiddles with fw_cfg data > > after the fact. So this hack is updated to account for this appending, > > by reserving some bytes. > > > > Cc: x86@kernel.org > > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > > Cc: H. Peter Anvin <hpa@zytor.com> > > Cc: Borislav Petkov <bp@alien8.de> > > Cc: Eric Biggers <ebiggers@kernel.org> > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > I apologize if this has already been reported/fixed already (I did a > brief search on lore.kernel.org) or if my terminology is not as precise > as it could be, I am a little out of my element here :) > > After this change as commit eac7a7791b ("x86: don't let decompressed > kernel image clobber setup_data") in QEMU master, I am no longer able to > boot a kernel directly through OVMF using '-append' + '-initrd' + > '-kernel'. Instead, systemd-boot tries to start the distribution's > kernel, which fails with: > > Error registering initrd: Already started > > This can be reproduced with just a defconfig Linux kernel (I used > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for > boot testing [1], and OVMF. Prior to this change, the kernel would start > up but after, I am dumped into the UEFI shell (as there is no > bootloader). > > The QEMU command I used was: > > $ qemu-system-x86_64 \ > -kernel arch/x86_64/boot/bzImage \ > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \ > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \ Oh no... Without jumping into it, at first glance, I have absolutely no idea. I suppose I could start debugging it and probably come up with a solution, but... @mst - I'm beginning to think that this whole setup_data route is cursed. This is accumulating hacks within hacks within hacks. What would you think if I just send a patch *removing* all use of setup_data (the rng seed and the dtb thing), and then we can gradually add that back with an actual overarching design. For example, it'd probably make sense to have a separate fwcfg file for setup_data rather than trying to mangle and existing one, etc. This way, we unbreak the tree, and let the new approach be reviewed more reasonably. Jason
On Wed, Feb 8, 2023 at 2:54 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Nathan (and MST), > > On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote: > > > > Hi Jason, > > > > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > > > The setup_data links are appended to the compressed kernel image. Since > > > the kernel image is typically loaded at 0x100000, setup_data lives at > > > `0x100000 + compressed_size`, which does not get relocated during the > > > kernel's boot process. > > > > > > The kernel typically decompresses the image starting at address > > > 0x1000000 (note: there's one more zero there than the compressed image > > > above). This usually is fine for most kernels. > > > > > > However, if the compressed image is actually quite large, then > > > setup_data will live at a `0x100000 + compressed_size` that extends into > > > the decompressed zone at 0x1000000. In other words, if compressed_size > > > is larger than `0x1000000 - 0x100000`, then the decompression step will > > > clobber setup_data, resulting in crashes. > > > > > > Visually, what happens now is that QEMU appends setup_data to the kernel > > > image: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > The problem is that this decompresses to 0x1000000 (one more zero). So > > > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > d e c o m p r e s s e d k e r n e l > > > |-------------------------------------------------------------| > > > 0x1000000 0x1000000+l3 > > > > > > The decompressed kernel seemingly overwriting the compressed kernel > > > image isn't a problem, because that gets relocated to a higher address > > > early on in the boot process, at the end of startup_64. setup_data, > > > however, stays in the same place, since those links are self referential > > > and nothing fixes them up. So the decompressed kernel clobbers it. > > > > > > Fix this by appending setup_data to the cmdline blob rather than the > > > kernel image blob, which remains at a lower address that won't get > > > clobbered. > > > > > > This could have been done by overwriting the initrd blob instead, but > > > that poses big difficulties, such as no longer being able to use memory > > > mapped files for initrd, hurting performance, and, more importantly, the > > > initrd address calculation is hard coded in qboot, and it always grows > > > down rather than up, which means lots of brittle semantics would have to > > > be changed around, incurring more complexity. In contrast, using cmdline > > > is simple and doesn't interfere with anything. > > > > > > The microvm machine has a gross hack where it fiddles with fw_cfg data > > > after the fact. So this hack is updated to account for this appending, > > > by reserving some bytes. > > > > > > Cc: x86@kernel.org > > > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > > > Cc: H. Peter Anvin <hpa@zytor.com> > > > Cc: Borislav Petkov <bp@alien8.de> > > > Cc: Eric Biggers <ebiggers@kernel.org> > > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > I apologize if this has already been reported/fixed already (I did a > > brief search on lore.kernel.org) or if my terminology is not as precise > > as it could be, I am a little out of my element here :) > > > > After this change as commit eac7a7791b ("x86: don't let decompressed > > kernel image clobber setup_data") in QEMU master, I am no longer able to > > boot a kernel directly through OVMF using '-append' + '-initrd' + > > '-kernel'. Instead, systemd-boot tries to start the distribution's > > kernel, which fails with: > > > > Error registering initrd: Already started > > > > This can be reproduced with just a defconfig Linux kernel (I used > > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for > > boot testing [1], and OVMF. Prior to this change, the kernel would start > > up but after, I am dumped into the UEFI shell (as there is no > > bootloader). > > > > The QEMU command I used was: > > > > $ qemu-system-x86_64 \ > > -kernel arch/x86_64/boot/bzImage \ > > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \ > > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on > > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \ > > Oh no... Without jumping into it, at first glance, I have absolutely > no idea. I suppose I could start debugging it and probably come up > with a solution, but... > > @mst - I'm beginning to think that this whole setup_data route is > cursed. This is accumulating hacks within hacks within hacks. What > would you think if I just send a patch *removing* all use of > setup_data (the rng seed and the dtb thing), and then we can gradually > add that back with an actual overarching design. For example, it'd > probably make sense to have a separate fwcfg file for setup_data > rather than trying to mangle and existing one, etc. This way, we > unbreak the tree, and let the new approach be reviewed more > reasonably. Sent as https://lore.kernel.org/qemu-devel/20230208180835.234638-1-Jason@zx2c4.com/
On Wed, Feb 08, 2023 at 02:54:41PM -0300, Jason A. Donenfeld wrote: > Hi Nathan (and MST), > > On Wed, Feb 8, 2023 at 2:45 PM Nathan Chancellor <nathan@kernel.org> wrote: > > > > Hi Jason, > > > > On Fri, Dec 30, 2022 at 11:07:25PM +0100, Jason A. Donenfeld wrote: > > > The setup_data links are appended to the compressed kernel image. Since > > > the kernel image is typically loaded at 0x100000, setup_data lives at > > > `0x100000 + compressed_size`, which does not get relocated during the > > > kernel's boot process. > > > > > > The kernel typically decompresses the image starting at address > > > 0x1000000 (note: there's one more zero there than the compressed image > > > above). This usually is fine for most kernels. > > > > > > However, if the compressed image is actually quite large, then > > > setup_data will live at a `0x100000 + compressed_size` that extends into > > > the decompressed zone at 0x1000000. In other words, if compressed_size > > > is larger than `0x1000000 - 0x100000`, then the decompression step will > > > clobber setup_data, resulting in crashes. > > > > > > Visually, what happens now is that QEMU appends setup_data to the kernel > > > image: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > The problem is that this decompresses to 0x1000000 (one more zero). So > > > if l1 is > (0x1000000-0x100000), then this winds up looking like: > > > > > > kernel image setup_data > > > |--------------------------||----------------| > > > 0x100000 0x100000+l1 0x100000+l1+l2 > > > > > > d e c o m p r e s s e d k e r n e l > > > |-------------------------------------------------------------| > > > 0x1000000 0x1000000+l3 > > > > > > The decompressed kernel seemingly overwriting the compressed kernel > > > image isn't a problem, because that gets relocated to a higher address > > > early on in the boot process, at the end of startup_64. setup_data, > > > however, stays in the same place, since those links are self referential > > > and nothing fixes them up. So the decompressed kernel clobbers it. > > > > > > Fix this by appending setup_data to the cmdline blob rather than the > > > kernel image blob, which remains at a lower address that won't get > > > clobbered. > > > > > > This could have been done by overwriting the initrd blob instead, but > > > that poses big difficulties, such as no longer being able to use memory > > > mapped files for initrd, hurting performance, and, more importantly, the > > > initrd address calculation is hard coded in qboot, and it always grows > > > down rather than up, which means lots of brittle semantics would have to > > > be changed around, incurring more complexity. In contrast, using cmdline > > > is simple and doesn't interfere with anything. > > > > > > The microvm machine has a gross hack where it fiddles with fw_cfg data > > > after the fact. So this hack is updated to account for this appending, > > > by reserving some bytes. > > > > > > Cc: x86@kernel.org > > > Cc: Philippe Mathieu-Daudé <philmd@linaro.org> > > > Cc: H. Peter Anvin <hpa@zytor.com> > > > Cc: Borislav Petkov <bp@alien8.de> > > > Cc: Eric Biggers <ebiggers@kernel.org> > > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > I apologize if this has already been reported/fixed already (I did a > > brief search on lore.kernel.org) or if my terminology is not as precise > > as it could be, I am a little out of my element here :) > > > > After this change as commit eac7a7791b ("x86: don't let decompressed > > kernel image clobber setup_data") in QEMU master, I am no longer able to > > boot a kernel directly through OVMF using '-append' + '-initrd' + > > '-kernel'. Instead, systemd-boot tries to start the distribution's > > kernel, which fails with: > > > > Error registering initrd: Already started > > > > This can be reproduced with just a defconfig Linux kernel (I used > > 6.2-rc7), the simple buildroot images that ClangBuiltLinux uses for > > boot testing [1], and OVMF. Prior to this change, the kernel would start > > up but after, I am dumped into the UEFI shell (as there is no > > bootloader). > > > > The QEMU command I used was: > > > > $ qemu-system-x86_64 \ > > -kernel arch/x86_64/boot/bzImage \ > > -append "console=ttyS0 earlycon=uart8250,io,0x3f8" \ > > -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on > > -drive if=pflash,format=raw,file=../boot-utils/images/x86_64/OVMF_VARS.fd \ > > Oh no... Without jumping into it, at first glance, I have absolutely > no idea. I suppose I could start debugging it and probably come up > with a solution, but... > > @mst - I'm beginning to think that this whole setup_data route is > cursed. This is accumulating hacks within hacks within hacks. What > would you think if I just send a patch *removing* all use of > setup_data (the rng seed and the dtb thing), and then we can gradually > add that back with an actual overarching design. For example, it'd > probably make sense to have a separate fwcfg file for setup_data > rather than trying to mangle and existing one, etc. This way, we > unbreak the tree, and let the new approach be reviewed more > reasonably. > > Jason Yea basically this is close to what I proposed. I can't off-hand figure out whether just dropping all of this is fine or we need some compat hacks to enable this in some way for 7.2, and it's pretty late here. Suggestions welcome.
diff --git a/hw/i386/microvm.c b/hw/i386/microvm.c index 170a331e3f..1b19d28c02 100644 --- a/hw/i386/microvm.c +++ b/hw/i386/microvm.c @@ -378,7 +378,8 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) MicrovmMachineState *mms = MICROVM_MACHINE(machine); BusState *bus; BusChild *kid; - char *cmdline; + char *cmdline, *existing_cmdline = fw_cfg_read_bytes_ptr(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA); + size_t len; /* * Find MMIO transports with attached devices, and add them to the kernel @@ -387,7 +388,7 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) * Yes, this is a hack, but one that heavily improves the UX without * introducing any significant issues. */ - cmdline = g_strdup(machine->kernel_cmdline); + cmdline = g_strdup(existing_cmdline); bus = sysbus_get_default(); QTAILQ_FOREACH(kid, &bus->children, sibling) { DeviceState *dev = kid->child; @@ -411,9 +412,11 @@ static void microvm_fix_kernel_cmdline(MachineState *machine) } } - fw_cfg_modify_i32(x86ms->fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(cmdline) + 1); - fw_cfg_modify_string(x86ms->fw_cfg, FW_CFG_CMDLINE_DATA, cmdline); - + len = strlen(cmdline); + if (len > VIRTIO_CMDLINE_TOTAL_MAX_LEN + strlen(existing_cmdline)) + fprintf(stderr, "qemu: virtio mmio cmdline too large, skipping\n"); + else + memcpy(existing_cmdline, cmdline, len + 1); g_free(cmdline); } diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 78cc131926..b57a993596 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -50,6 +50,7 @@ #include "hw/intc/i8259.h" #include "hw/rtc/mc146818rtc.h" #include "target/i386/sev.h" +#include "hw/i386/microvm.h" #include "hw/acpi/cpu_hotplug.h" #include "hw/irq.h" @@ -802,7 +803,7 @@ void x86_load_linux(X86MachineState *x86ms, bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled; uint16_t protocol; int setup_size, kernel_size, cmdline_size; - int dtb_size, setup_data_offset; + int dtb_size; uint32_t initrd_max; uint8_t header[8192], *setup, *kernel; hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0; @@ -813,12 +814,16 @@ void x86_load_linux(X86MachineState *x86ms, const char *kernel_filename = machine->kernel_filename; const char *initrd_filename = machine->initrd_filename; const char *dtb_filename = machine->dtb; - const char *kernel_cmdline = machine->kernel_cmdline; + char *kernel_cmdline; SevKernelLoaderContext sev_load_ctx = {}; enum { RNG_SEED_LENGTH = 32 }; - /* Align to 16 bytes as a paranoia measure */ - cmdline_size = (strlen(kernel_cmdline) + 16) & ~15; + /* Add the NUL terminator, some padding for the microvm cmdline fiddling + * hack, and then align to 16 bytes as a paranoia measure */ + cmdline_size = (strlen(machine->kernel_cmdline) + 1 + + VIRTIO_CMDLINE_TOTAL_MAX_LEN + 16) & ~15; + /* Make a copy, since we might append arbitrary bytes to it later. */ + kernel_cmdline = g_strndup(machine->kernel_cmdline, cmdline_size); /* load the kernel header */ f = fopen(kernel_filename, "rb"); @@ -959,12 +964,6 @@ void x86_load_linux(X86MachineState *x86ms, initrd_max = x86ms->below_4g_mem_size - acpi_data_size - 1; } - fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr); - fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, strlen(kernel_cmdline) + 1); - fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline); - sev_load_ctx.cmdline_data = (char *)kernel_cmdline; - sev_load_ctx.cmdline_size = strlen(kernel_cmdline) + 1; - if (protocol >= 0x202) { stl_p(header + 0x228, cmdline_addr); } else { @@ -1091,27 +1090,22 @@ void x86_load_linux(X86MachineState *x86ms, exit(1); } - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); - kernel_size = setup_data_offset + sizeof(SetupData) + dtb_size; - kernel = g_realloc(kernel, kernel_size); - - - setup_data = (SetupData *)(kernel + setup_data_offset); + cmdline_size += sizeof(SetupData) + dtb_size; + kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size); + setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + dtb_size); setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = prot_addr + setup_data_offset; + first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline); setup_data->type = cpu_to_le32(SETUP_DTB); setup_data->len = cpu_to_le32(dtb_size); - load_image_size(dtb_filename, setup_data->data, dtb_size); } - if (!legacy_no_rng_seed) { - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); - kernel_size = setup_data_offset + sizeof(SetupData) + RNG_SEED_LENGTH; - kernel = g_realloc(kernel, kernel_size); - setup_data = (SetupData *)(kernel + setup_data_offset); + if (!legacy_no_rng_seed && protocol >= 0x209) { + cmdline_size += sizeof(SetupData) + RNG_SEED_LENGTH; + kernel_cmdline = g_realloc(kernel_cmdline, cmdline_size); + setup_data = (void *)kernel_cmdline + cmdline_size - (sizeof(SetupData) + RNG_SEED_LENGTH); setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = prot_addr + setup_data_offset; + first_setup_data = cmdline_addr + ((void *)setup_data - (void *)kernel_cmdline); setup_data->type = cpu_to_le32(SETUP_RNG_SEED); setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH); @@ -1122,6 +1116,12 @@ void x86_load_linux(X86MachineState *x86ms, fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size); } + fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_ADDR, cmdline_addr); + fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE, cmdline_size); + fw_cfg_add_bytes(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline, cmdline_size); + sev_load_ctx.cmdline_data = (char *)kernel_cmdline; + sev_load_ctx.cmdline_size = cmdline_size; + fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr); fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size); sev_load_ctx.kernel_data = (char *)kernel; @@ -1134,7 +1134,7 @@ void x86_load_linux(X86MachineState *x86ms, * kernel on the other side of the fw_cfg interface matches the hash of the * file the user passed in. */ - if (!sev_enabled()) { + if (!sev_enabled() && first_setup_data) { SetupDataFixup *fixup = g_malloc(sizeof(*fixup)); memcpy(setup, header, MIN(sizeof(header), setup_size)); diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c index a00881bc64..432754eda4 100644 --- a/hw/nvram/fw_cfg.c +++ b/hw/nvram/fw_cfg.c @@ -741,6 +741,15 @@ void fw_cfg_add_bytes(FWCfgState *s, uint16_t key, void *data, size_t len) fw_cfg_add_bytes_callback(s, key, NULL, NULL, NULL, data, len, true); } +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key) +{ + int arch = !!(key & FW_CFG_ARCH_LOCAL); + + key &= FW_CFG_ENTRY_MASK; + assert(key < fw_cfg_max_entry(s)); + return s->entries[arch][key].data; +} + void fw_cfg_add_string(FWCfgState *s, uint16_t key, const char *value) { size_t sz = strlen(value) + 1; diff --git a/include/hw/i386/microvm.h b/include/hw/i386/microvm.h index fad97a891d..e8af61f194 100644 --- a/include/hw/i386/microvm.h +++ b/include/hw/i386/microvm.h @@ -50,8 +50,9 @@ */ /* Platform virtio definitions */ -#define VIRTIO_MMIO_BASE 0xfeb00000 -#define VIRTIO_CMDLINE_MAXLEN 64 +#define VIRTIO_MMIO_BASE 0xfeb00000 +#define VIRTIO_CMDLINE_MAXLEN 64 +#define VIRTIO_CMDLINE_TOTAL_MAX_LEN ((VIRTIO_CMDLINE_MAXLEN + 1) * 16) #define GED_MMIO_BASE 0xfea00000 #define GED_MMIO_BASE_MEMHP (GED_MMIO_BASE + 0x100) diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h index 2e503904dc..990dcdbb2e 100644 --- a/include/hw/nvram/fw_cfg.h +++ b/include/hw/nvram/fw_cfg.h @@ -139,6 +139,15 @@ void fw_cfg_add_bytes_callback(FWCfgState *s, uint16_t key, void *data, size_t len, bool read_only); +/** + * fw_cfg_read_bytes_ptr: + * @s: fw_cfg device being modified + * @key: selector key value for new fw_cfg item + * + * Reads an existing fw_cfg data pointer. + */ +void *fw_cfg_read_bytes_ptr(FWCfgState *s, uint16_t key); + /** * fw_cfg_add_string: * @s: fw_cfg device being modified
The setup_data links are appended to the compressed kernel image. Since the kernel image is typically loaded at 0x100000, setup_data lives at `0x100000 + compressed_size`, which does not get relocated during the kernel's boot process. The kernel typically decompresses the image starting at address 0x1000000 (note: there's one more zero there than the compressed image above). This usually is fine for most kernels. However, if the compressed image is actually quite large, then setup_data will live at a `0x100000 + compressed_size` that extends into the decompressed zone at 0x1000000. In other words, if compressed_size is larger than `0x1000000 - 0x100000`, then the decompression step will clobber setup_data, resulting in crashes. Visually, what happens now is that QEMU appends setup_data to the kernel image: kernel image setup_data |--------------------------||----------------| 0x100000 0x100000+l1 0x100000+l1+l2 The problem is that this decompresses to 0x1000000 (one more zero). So if l1 is > (0x1000000-0x100000), then this winds up looking like: kernel image setup_data |--------------------------||----------------| 0x100000 0x100000+l1 0x100000+l1+l2 d e c o m p r e s s e d k e r n e l |-------------------------------------------------------------| 0x1000000 0x1000000+l3 The decompressed kernel seemingly overwriting the compressed kernel image isn't a problem, because that gets relocated to a higher address early on in the boot process, at the end of startup_64. setup_data, however, stays in the same place, since those links are self referential and nothing fixes them up. So the decompressed kernel clobbers it. Fix this by appending setup_data to the cmdline blob rather than the kernel image blob, which remains at a lower address that won't get clobbered. This could have been done by overwriting the initrd blob instead, but that poses big difficulties, such as no longer being able to use memory mapped files for initrd, hurting performance, and, more importantly, the initrd address calculation is hard coded in qboot, and it always grows down rather than up, which means lots of brittle semantics would have to be changed around, incurring more complexity. In contrast, using cmdline is simple and doesn't interfere with anything. The microvm machine has a gross hack where it fiddles with fw_cfg data after the fact. So this hack is updated to account for this appending, by reserving some bytes. Cc: x86@kernel.org Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- Changes v2->v3: - Fix mistakes in string handling. Changes v1->v2: - Append setup_data to cmdline instead of kernel image. hw/i386/microvm.c | 13 ++++++---- hw/i386/x86.c | 50 +++++++++++++++++++-------------------- hw/nvram/fw_cfg.c | 9 +++++++ include/hw/i386/microvm.h | 5 ++-- include/hw/nvram/fw_cfg.h | 9 +++++++ 5 files changed, 54 insertions(+), 32 deletions(-)