Message ID | 20161025144959.GZ30231@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Oct 25, 2016 at 03:49:59PM +0100, Wei Liu wrote: > On Tue, Oct 25, 2016 at 01:37:45PM +0200, Sander Eikelenboom wrote: > > > > Tuesday, October 25, 2016, 1:24:12 PM, you wrote: > > > > > On Tue, Oct 18, 2016 at 01:48:23PM +0100, Wei Liu wrote: > > >> On Mon, Oct 17, 2016 at 05:28:17PM +0200, Sander Eikelenboom wrote: > > >> > Thursday, October 13, 2016, 4:43:31 PM, you wrote: > > >> > > > >> > > Hi Jan / Wei, > > >> > > > >> > > Took a while before i had the chance to fiddle some more to find the actual culprit. > > >> > > After analyzing the output of xl -vvvvv create somewhat more i came to the > > >> > > insight it was probably Qemu and not Xen causing the fault. > > >> > > > >> > > As a test I just used a qemu-xen binary build with xen-4.6.0 booting up a guest with > > >> > > direct kernel boot mode on xen-unstable. And that old qemu binary works fine. > > >> > > > >> > > After testing i can conclude, Jan was right, the bisection was a red herring, > > >> > > the problem is caused by some change in Qemu and not by something in the Xen tree. > > >> > > (strange thing is that for as far as i know i did a "make distclean" between > > >> > > every build (taking a lot of time), which should have pulled a fresh qemu-xen > > >> > > tree and therefor the bisection should have lead to a commit with a Config.mk > > >> > > hash change for qemu-xen version.) > > >> > > > >> > > Will see if i can find some more time and bisect qemu and find the culprit. > > >> > > > >> > > -- > > >> > > Sander > > >> > > > >> > > > >> > Unfortunately i have to give up on this issue, for me it's impossible to bisect this > > >> > issue with my present git-foo. > > >> > > > >> > The first try with bisection of the whole xen-tree seems to have hit the issue that the > > >> > qemu-revision that gets pulled on a fresh build is "master" during the whole > > >> > dev period. That creates havoc when trying to bisect, since you are testing > > >> > combinations that were never developed (nor auto tested) in that combination > > >> > (especially when a xen-tree and qemu-tree change have a dependency like Roger's > > >> > "xen: fix usage of xc_domain_create in domain builder") > > >> > > > >> > While trying to bisect only qemu (keeping xen itself on RELEASE-4.6.0 and > > >> > seabios on rel-1.8.2) it get stuck on issues with that tree. > > >> > Between 4.6.0 and 4.7.0 the qemu tree switched from git://xenbits.xen.org/qemu-upstream-4.6-testing.git > > >> > to git://xenbits.xen.org/qemu-xen.git),after that there seem to have > > >> > been a lot of merges going back and forth and to me it seems a mess (but as i > > >> > said it could also be a lack of git-foo). I tried by manual bisecting, removing > > >> > and cloning trees again etc. but that doesn't suffice, it's all going no-where. > > >> > (while the known good build (plain RELEASE-4.6.0) always works, so it doesn't > > >> > seem to be some random problem) > > >> > > > >> > > >> Thanks for trying. > > >> > > >> > So perhaps some dev can at least verify that the issue is there (since 4.7.0) > > >> > and put it on the "known broken" list of things. > > >> > > > >> > > >> I will put this into the list of things I need to look at. > > >> > > > > > I investigated this a bit. The root cause is the memory accounting is > > > wrong in QEMU. It would try to allocate more ram than allowed. I haven't > > > tried to figure out exactly what is wrong, though. > > > > That confirms what i was thinking in the end, but bisection the qemu-tree > > changes between the xen-4.6.0 and xen-4.7.0 release proved to be pretty > > difficult as i explained. So i you have a hunch as to in what code it should > > reside debugging instead of bisecting would probably be better. > > (so one of the questions is what changes in the memory accounting when you > > supply the kernel from the host instead of the guest, since booting a kernel > > with grub from within the guest doesn't give any memory accounting issues.) > > > > Thanks for investigating ! > > I think I hunted down the offending function. > > Mind trying this patch for me? > > ---8<--- > From 3c7f8b55109959cf470deeee452f452f7c0ade51 Mon Sep 17 00:00:00 2001 > From: Wei Liu <wei.liu2@citrix.com> > Date: Tue, 25 Oct 2016 15:45:04 +0100 > Subject: [PATCH] acpi: don't build acpi tables for xen guests > > Xen's toolstack is in charge of building ACPI tables. Skip acpi table > building if running on Xen. > > This issue is discovered due to direct kernel boot on Xen doesn't boot > anymore, because the new ACPI tables cause the guest to exceed its > memory allocation limit. > > Reported-by: Sander Eikelenboom <linux@eikelenboom.it> > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > --- > Cc: Anthony PERARD <anthony.perard@citrix.com> > Cc: Stefano Stabellini <sstabellini@kernel.org> > > RFC because I'm not sure this is the best way to fix it. > --- > hw/i386/acpi-build.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index a26a4bb..6ba5031 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -45,6 +45,7 @@ > #include "sysemu/tpm_backend.h" > #include "hw/timer/mc146818rtc_regs.h" > #include "sysemu/numa.h" > +#include "hw/xen/xen.h" > > /* Supported chipsets: */ > #include "hw/acpi/piix4.h" > @@ -2865,6 +2866,12 @@ void acpi_setup(void) > return; > } > > + if (xen_enabled()) { > + fprintf(stderr, "%s %d\n", __FILE__, __LINE__); Oops, this is just debug output - but you get the idea. > + ACPI_BUILD_DPRINTF("Xen enabled. Bailing out.\n"); > + return; > + } > + > build_state = g_malloc0(sizeof *build_state); > > acpi_set_pci_info(); > -- > 2.1.4 > >
On 2016-10-25 16:49, Wei Liu wrote: > On Tue, Oct 25, 2016 at 01:37:45PM +0200, Sander Eikelenboom wrote: >> >> Tuesday, October 25, 2016, 1:24:12 PM, you wrote: >> >> > On Tue, Oct 18, 2016 at 01:48:23PM +0100, Wei Liu wrote: >> >> On Mon, Oct 17, 2016 at 05:28:17PM +0200, Sander Eikelenboom wrote: >> >> > Thursday, October 13, 2016, 4:43:31 PM, you wrote: >> >> > >> >> > > Hi Jan / Wei, >> >> > >> >> > > Took a while before i had the chance to fiddle some more to find the actual culprit. >> >> > > After analyzing the output of xl -vvvvv create somewhat more i came to the >> >> > > insight it was probably Qemu and not Xen causing the fault. >> >> > >> >> > > As a test I just used a qemu-xen binary build with xen-4.6.0 booting up a guest with >> >> > > direct kernel boot mode on xen-unstable. And that old qemu binary works fine. >> >> > >> >> > > After testing i can conclude, Jan was right, the bisection was a red herring, >> >> > > the problem is caused by some change in Qemu and not by something in the Xen tree. >> >> > > (strange thing is that for as far as i know i did a "make distclean" between >> >> > > every build (taking a lot of time), which should have pulled a fresh qemu-xen >> >> > > tree and therefor the bisection should have lead to a commit with a Config.mk >> >> > > hash change for qemu-xen version.) >> >> > >> >> > > Will see if i can find some more time and bisect qemu and find the culprit. >> >> > >> >> > > -- >> >> > > Sander >> >> > >> >> > >> >> > Unfortunately i have to give up on this issue, for me it's impossible to bisect this >> >> > issue with my present git-foo. >> >> > >> >> > The first try with bisection of the whole xen-tree seems to have hit the issue that the >> >> > qemu-revision that gets pulled on a fresh build is "master" during the whole >> >> > dev period. That creates havoc when trying to bisect, since you are testing >> >> > combinations that were never developed (nor auto tested) in that combination >> >> > (especially when a xen-tree and qemu-tree change have a dependency like Roger's >> >> > "xen: fix usage of xc_domain_create in domain builder") >> >> > >> >> > While trying to bisect only qemu (keeping xen itself on RELEASE-4.6.0 and >> >> > seabios on rel-1.8.2) it get stuck on issues with that tree. >> >> > Between 4.6.0 and 4.7.0 the qemu tree switched from git://xenbits.xen.org/qemu-upstream-4.6-testing.git >> >> > to git://xenbits.xen.org/qemu-xen.git),after that there seem to have >> >> > been a lot of merges going back and forth and to me it seems a mess (but as i >> >> > said it could also be a lack of git-foo). I tried by manual bisecting, removing >> >> > and cloning trees again etc. but that doesn't suffice, it's all going no-where. >> >> > (while the known good build (plain RELEASE-4.6.0) always works, so it doesn't >> >> > seem to be some random problem) >> >> > >> >> >> >> Thanks for trying. >> >> >> >> > So perhaps some dev can at least verify that the issue is there (since 4.7.0) >> >> > and put it on the "known broken" list of things. >> >> > >> >> >> >> I will put this into the list of things I need to look at. >> >> >> >> > I investigated this a bit. The root cause is the memory accounting is >> > wrong in QEMU. It would try to allocate more ram than allowed. I haven't >> > tried to figure out exactly what is wrong, though. >> >> That confirms what i was thinking in the end, but bisection the >> qemu-tree >> changes between the xen-4.6.0 and xen-4.7.0 release proved to be >> pretty >> difficult as i explained. So i you have a hunch as to in what code it >> should >> reside debugging instead of bisecting would probably be better. >> (so one of the questions is what changes in the memory accounting when >> you >> supply the kernel from the host instead of the guest, since booting a >> kernel >> with grub from within the guest doesn't give any memory accounting >> issues.) >> >> Thanks for investigating ! > > I think I hunted down the offending function. > > Mind trying this patch for me? Hi Wei, This seems to help :) With a linux 4.8 kernel the HVM guest now boots fine with direct kernel boot ! But there seems to be a gotcha which i think is not in the Xen docs/wiki: when trying a linux 4.3 kernel the guest still didn't boot and i got a: "qemu: linux kernel too old to load a ram disk" in the qemu log. I don't know what qemu regards as "old" in this case. Another considiration: would it be worthwhile to add an OSStest for direct kernel boot ? (under the assumption that the host kernel that gets build can also boot on HVM guest it's probably a very cheap test not requiring any additional builds.) Thanks again ! -- Sander > ---8<--- > From 3c7f8b55109959cf470deeee452f452f7c0ade51 Mon Sep 17 00:00:00 2001 > From: Wei Liu <wei.liu2@citrix.com> > Date: Tue, 25 Oct 2016 15:45:04 +0100 > Subject: [PATCH] acpi: don't build acpi tables for xen guests > > Xen's toolstack is in charge of building ACPI tables. Skip acpi table > building if running on Xen. > > This issue is discovered due to direct kernel boot on Xen doesn't boot > anymore, because the new ACPI tables cause the guest to exceed its > memory allocation limit. > > Reported-by: Sander Eikelenboom <linux@eikelenboom.it> > Signed-off-by: Wei Liu <wei.liu2@citrix.com> > --- > Cc: Anthony PERARD <anthony.perard@citrix.com> > Cc: Stefano Stabellini <sstabellini@kernel.org> > > RFC because I'm not sure this is the best way to fix it. > --- > hw/i386/acpi-build.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c > index a26a4bb..6ba5031 100644 > --- a/hw/i386/acpi-build.c > +++ b/hw/i386/acpi-build.c > @@ -45,6 +45,7 @@ > #include "sysemu/tpm_backend.h" > #include "hw/timer/mc146818rtc_regs.h" > #include "sysemu/numa.h" > +#include "hw/xen/xen.h" > > /* Supported chipsets: */ > #include "hw/acpi/piix4.h" > @@ -2865,6 +2866,12 @@ void acpi_setup(void) > return; > } > > + if (xen_enabled()) { > + fprintf(stderr, "%s %d\n", __FILE__, __LINE__); > + ACPI_BUILD_DPRINTF("Xen enabled. Bailing out.\n"); > + return; > + } > + > build_state = g_malloc0(sizeof *build_state); > > acpi_set_pci_info();
On Tue, Oct 25, 2016 at 07:25:06PM +0200, Sander Eikelenboom wrote: > On 2016-10-25 16:49, Wei Liu wrote: > >On Tue, Oct 25, 2016 at 01:37:45PM +0200, Sander Eikelenboom wrote: > >> > >>Tuesday, October 25, 2016, 1:24:12 PM, you wrote: > >> > >>> On Tue, Oct 18, 2016 at 01:48:23PM +0100, Wei Liu wrote: > >>>> On Mon, Oct 17, 2016 at 05:28:17PM +0200, Sander Eikelenboom wrote: > >>>> > Thursday, October 13, 2016, 4:43:31 PM, you wrote: > >>>> > > >>>> > > Hi Jan / Wei, > >>>> > > >>>> > > Took a while before i had the chance to fiddle some more to find the actual culprit. > >>>> > > After analyzing the output of xl -vvvvv create somewhat more i came to the > >>>> > > insight it was probably Qemu and not Xen causing the fault. > >>>> > > >>>> > > As a test I just used a qemu-xen binary build with xen-4.6.0 booting up a guest with > >>>> > > direct kernel boot mode on xen-unstable. And that old qemu binary works fine. > >>>> > > >>>> > > After testing i can conclude, Jan was right, the bisection was a red herring, > >>>> > > the problem is caused by some change in Qemu and not by something in the Xen tree. > >>>> > > (strange thing is that for as far as i know i did a "make distclean" between > >>>> > > every build (taking a lot of time), which should have pulled a fresh qemu-xen > >>>> > > tree and therefor the bisection should have lead to a commit with a Config.mk > >>>> > > hash change for qemu-xen version.) > >>>> > > >>>> > > Will see if i can find some more time and bisect qemu and find the culprit. > >>>> > > >>>> > > -- > >>>> > > Sander > >>>> > > >>>> > > >>>> > Unfortunately i have to give up on this issue, for me it's impossible to bisect this > >>>> > issue with my present git-foo. > >>>> > > >>>> > The first try with bisection of the whole xen-tree seems to have hit the issue that the > >>>> > qemu-revision that gets pulled on a fresh build is "master" during the whole > >>>> > dev period. That creates havoc when trying to bisect, since you are testing > >>>> > combinations that were never developed (nor auto tested) in that combination > >>>> > (especially when a xen-tree and qemu-tree change have a dependency like Roger's > >>>> > "xen: fix usage of xc_domain_create in domain builder") > >>>> > > >>>> > While trying to bisect only qemu (keeping xen itself on RELEASE-4.6.0 and > >>>> > seabios on rel-1.8.2) it get stuck on issues with that tree. > >>>> > Between 4.6.0 and 4.7.0 the qemu tree switched from git://xenbits.xen.org/qemu-upstream-4.6-testing.git > >>>> > to git://xenbits.xen.org/qemu-xen.git),after that there seem to have > >>>> > been a lot of merges going back and forth and to me it seems a mess (but as i > >>>> > said it could also be a lack of git-foo). I tried by manual bisecting, removing > >>>> > and cloning trees again etc. but that doesn't suffice, it's all going no-where. > >>>> > (while the known good build (plain RELEASE-4.6.0) always works, so it doesn't > >>>> > seem to be some random problem) > >>>> > > >>>> > >>>> Thanks for trying. > >>>> > >>>> > So perhaps some dev can at least verify that the issue is there (since 4.7.0) > >>>> > and put it on the "known broken" list of things. > >>>> > > >>>> > >>>> I will put this into the list of things I need to look at. > >>>> > >> > >>> I investigated this a bit. The root cause is the memory accounting is > >>> wrong in QEMU. It would try to allocate more ram than allowed. I haven't > >>> tried to figure out exactly what is wrong, though. > >> > >>That confirms what i was thinking in the end, but bisection the > >>qemu-tree > >>changes between the xen-4.6.0 and xen-4.7.0 release proved to be pretty > >>difficult as i explained. So i you have a hunch as to in what code it > >>should > >>reside debugging instead of bisecting would probably be better. > >>(so one of the questions is what changes in the memory accounting when > >>you > >>supply the kernel from the host instead of the guest, since booting a > >>kernel > >>with grub from within the guest doesn't give any memory accounting > >>issues.) > >> > >>Thanks for investigating ! > > > >I think I hunted down the offending function. > > > >Mind trying this patch for me? > > Hi Wei, > > This seems to help :) > > With a linux 4.8 kernel the HVM guest now boots fine with direct kernel boot > ! > > But there seems to be a gotcha which i think is not in the Xen docs/wiki: > when trying a linux 4.3 kernel the guest still didn't boot and i got a: > "qemu: linux kernel too old to load a ram disk" in the qemu log. > I don't know what qemu regards as "old" in this case. > QEMU checks for a signature / version in kernel header or whatnot. I can't tell why that specific number is chosen, though. > Another considiration: would it be worthwhile to add an OSStest for direct > kernel boot ? > (under the assumption that the host kernel that gets build can also boot on > HVM guest it's probably a very cheap test not requiring any additional > builds.) Yes, definitely. The more tests, the merrier. Wei.
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c index a26a4bb..6ba5031 100644 --- a/hw/i386/acpi-build.c +++ b/hw/i386/acpi-build.c @@ -45,6 +45,7 @@ #include "sysemu/tpm_backend.h" #include "hw/timer/mc146818rtc_regs.h" #include "sysemu/numa.h" +#include "hw/xen/xen.h" /* Supported chipsets: */ #include "hw/acpi/piix4.h" @@ -2865,6 +2866,12 @@ void acpi_setup(void) return; } + if (xen_enabled()) { + fprintf(stderr, "%s %d\n", __FILE__, __LINE__); + ACPI_BUILD_DPRINTF("Xen enabled. Bailing out.\n"); + return; + } + build_state = g_malloc0(sizeof *build_state); acpi_set_pci_info();