diff mbox

[OSSTEST,11/16] ts-debian-fixup: use correct resume device

Message ID 20171020103840.32762-12-wei.liu2@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wei Liu Oct. 20, 2017, 10:38 a.m. UTC
See code comment for explanation.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
---
 ts-debian-fixup | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Ian Jackson Oct. 20, 2017, 11:05 a.m. UTC | #1
Wei Liu writes ("[OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> See code comment for explanation.
...
> +    # There might be stale entries in /etc/initramfs-tools/conf.d/resume
> +    # which get stored in the initramfs. That introduces delay in guest booting
> +    # which might cause tests to fail.

Why might there be such stale entries ?

> +    # This is particularly prominent in systemd enabled releases when it tries
> +    # to scan for the inexistent device(s) for a long time.

Why is this not a bug in systemd ?  If there are workarounds in
osstest for Debian, I usually want a bug in the BTS, and a suite limit
on the workaround so that we poke Debian again in 2 years time...

> +    # Override that in kernel command line with the correct swap partition.
> +    $cfg =~ m/'phy:.+-swap,(xvda\d+),.*'/;
> +    $extra .= " resume=/dev/$1";
> +    logm("change resume device to $1");

Alternatively, if indeed this is necessary and not due to bugs,
perhaps it should be done by xen-tools ?

Ian.
Wei Liu Oct. 20, 2017, 11:14 a.m. UTC | #2
On Fri, Oct 20, 2017 at 12:05:55PM +0100, Ian Jackson wrote:
> Wei Liu writes ("[OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > See code comment for explanation.
> ...
> > +    # There might be stale entries in /etc/initramfs-tools/conf.d/resume
> > +    # which get stored in the initramfs. That introduces delay in guest booting
> > +    # which might cause tests to fail.
> 
> Why might there be such stale entries ?

The ramdisk is taken from the host, which contains that file. The resume
device is going to point to the swap partition in the host.

> 
> > +    # This is particularly prominent in systemd enabled releases when it tries
> > +    # to scan for the inexistent device(s) for a long time.
> 
> Why is this not a bug in systemd ?  If there are workarounds in
> osstest for Debian, I usually want a bug in the BTS, and a suite limit
> on the workaround so that we poke Debian again in 2 years time...
> 

It has always been like that since forever. It appears that stretch will
try a lot harder than jessie (which only tries once). Now thinking about
it I can't say for sure whether that's due to the new kernel, or systemd
or both.

> > +    # Override that in kernel command line with the correct swap partition.
> > +    $cfg =~ m/'phy:.+-swap,(xvda\d+),.*'/;
> > +    $extra .= " resume=/dev/$1";
> > +    logm("change resume device to $1");
> 
> Alternatively, if indeed this is necessary and not due to bugs,
> perhaps it should be done by xen-tools ?
> 

No, see the first paragraph. xen-tools can't be expected to modify the
ramdisk.
Ian Jackson Oct. 20, 2017, 12:35 p.m. UTC | #3
Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> On Fri, Oct 20, 2017 at 12:05:55PM +0100, Ian Jackson wrote:
> > Wei Liu writes ("[OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > > See code comment for explanation.
> > ...
> > > +    # There might be stale entries in /etc/initramfs-tools/conf.d/resume
> > > +    # which get stored in the initramfs. That introduces delay in guest booting
> > > +    # which might cause tests to fail.
> > 
> > Why might there be such stale entries ?
> 
> The ramdisk is taken from the host, which contains that file. The resume
> device is going to point to the swap partition in the host.

Ah.  Hrm.  I guess this was always a bit of a bodge.

> > > +    # Override that in kernel command line with the correct swap partition.
> > > +    $cfg =~ m/'phy:.+-swap,(xvda\d+),.*'/;
> > > +    $extra .= " resume=/dev/$1";
> > > +    logm("change resume device to $1");
> > 
> > Alternatively, if indeed this is necessary and not due to bugs,
> > perhaps it should be done by xen-tools ?
> 
> No, see the first paragraph. xen-tools can't be expected to modify the
> ramdisk.

Maybe it should set the appropriate "extra" though.  IIRC xen-tools
does implement this sort-of-trick of reusing the hosts initrd.  So why
does this bug not happen with other users of xen-tools ?

Ian.
Wei Liu Oct. 20, 2017, 1:17 p.m. UTC | #4
On Fri, Oct 20, 2017 at 01:35:41PM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > On Fri, Oct 20, 2017 at 12:05:55PM +0100, Ian Jackson wrote:
> > > Wei Liu writes ("[OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > > > See code comment for explanation.
> > > ...
> > > > +    # There might be stale entries in /etc/initramfs-tools/conf.d/resume
> > > > +    # which get stored in the initramfs. That introduces delay in guest booting
> > > > +    # which might cause tests to fail.
> > > 
> > > Why might there be such stale entries ?
> > 
> > The ramdisk is taken from the host, which contains that file. The resume
> > device is going to point to the swap partition in the host.
> 
> Ah.  Hrm.  I guess this was always a bit of a bodge.
> 
> > > > +    # Override that in kernel command line with the correct swap partition.
> > > > +    $cfg =~ m/'phy:.+-swap,(xvda\d+),.*'/;
> > > > +    $extra .= " resume=/dev/$1";
> > > > +    logm("change resume device to $1");
> > > 
> > > Alternatively, if indeed this is necessary and not due to bugs,
> > > perhaps it should be done by xen-tools ?
> > 
> > No, see the first paragraph. xen-tools can't be expected to modify the
> > ramdisk.
> 
> Maybe it should set the appropriate "extra" though.  IIRC xen-tools
> does implement this sort-of-trick of reusing the hosts initrd.  So why
> does this bug not happen with other users of xen-tools ?
> 

I wouldn't call this behaviour a bug.  Most people won't notice it
because the guest will boot eventually. It is not like this will cause
the guest to crash.

Osstest sets a timeout. We can bump the timeout but that's worse than
the solution here.

There is no option in xen-create-image to manipulate extra=. The only
viable solution is to provide a new template. That seems overkill and
would require more code.
Ian Jackson Oct. 20, 2017, 1:20 p.m. UTC | #5
Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> I wouldn't call this behaviour a bug.  Most people won't notice it
> because the guest will boot eventually. It is not like this will cause
> the guest to crash.

I don't think the guest booting "eventually" after some timeout is
correct.

> There is no option in xen-create-image to manipulate extra=. The only
> viable solution is to provide a new template. That seems overkill and
> would require more code.

What you're saying is that this infelicity in xen-tools is not
particularly easy to fix.  That doesn't mean it shouldn't be reported
as a bug.

Thanks,
Ian.
Wei Liu Oct. 20, 2017, 1:36 p.m. UTC | #6
On Fri, Oct 20, 2017 at 02:20:29PM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > I wouldn't call this behaviour a bug.  Most people won't notice it
> > because the guest will boot eventually. It is not like this will cause
> > the guest to crash.
> 
> I don't think the guest booting "eventually" after some timeout is
> correct.
>

The OS is honoring whatever is written in the resume file. That's the
correct behaviour to me.

Booting after trying is better than not booting at all.  That parameter
isn't vital to a functional system. This has always been the case since
forever. It is only the "timeout" in the past is shorter so osstest
never noticed it.

What is not correct is the content of the file. We can't blame the guest
OS for that.

> > There is no option in xen-create-image to manipulate extra=. The only
> > viable solution is to provide a new template. That seems overkill and
> > would require more code.
> 
> What you're saying is that this infelicity in xen-tools is not
> particularly easy to fix.  That doesn't mean it shouldn't be reported
> as a bug.

There is nothing to fix. System administrator should provide a proper
template -- that's what xen-tools developers most likely to say.

It would require a lot more code in osstest to fix the template, and the
essence of such patch would still be the same.
Ian Jackson Oct. 20, 2017, 1:42 p.m. UTC | #7
Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> On Fri, Oct 20, 2017 at 02:20:29PM +0100, Ian Jackson wrote:
> > I don't think the guest booting "eventually" after some timeout is
> > correct.
> 
> The OS is honoring whatever is written in the resume file. That's the
> correct behaviour to me.
> 
> Booting after trying is better than not booting at all.  That parameter
> isn't vital to a functional system. This has always been the case since
> forever. It is only the "timeout" in the past is shorter so osstest
> never noticed it.
> 
> What is not correct is the content of the file. We can't blame the guest
> OS for that.

I agree with all of this.  But, the overall behaviour is not correct.
Even though all of the things that osstest specified to
xen-create-image were correct.  Therefore there is a bug.

> > > There is no option in xen-create-image to manipulate extra=. The only
> > > viable solution is to provide a new template. That seems overkill and
> > > would require more code.
> > 
> > What you're saying is that this infelicity in xen-tools is not
> > particularly easy to fix.  That doesn't mean it shouldn't be reported
> > as a bug.
> 
> There is nothing to fix. System administrator should provide a proper
> template -- that's what xen-tools developers most likely to say.

I'm not sure what you mean here.  We are using the config template
provided by xen-tools.  The reuse of the host's initrd is suggested by
the xen-create-image documentation.

Ian.
Wei Liu Oct. 20, 2017, 2:56 p.m. UTC | #8
On Fri, Oct 20, 2017 at 02:42:53PM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > On Fri, Oct 20, 2017 at 02:20:29PM +0100, Ian Jackson wrote:
> > > I don't think the guest booting "eventually" after some timeout is
> > > correct.
> > 
> > The OS is honoring whatever is written in the resume file. That's the
> > correct behaviour to me.
> > 
> > Booting after trying is better than not booting at all.  That parameter
> > isn't vital to a functional system. This has always been the case since
> > forever. It is only the "timeout" in the past is shorter so osstest
> > never noticed it.
> > 
> > What is not correct is the content of the file. We can't blame the guest
> > OS for that.
> 
> I agree with all of this.  But, the overall behaviour is not correct.
> Even though all of the things that osstest specified to
> xen-create-image were correct.  Therefore there is a bug.
> 
> > > > There is no option in xen-create-image to manipulate extra=. The only
> > > > viable solution is to provide a new template. That seems overkill and
> > > > would require more code.
> > > 
> > > What you're saying is that this infelicity in xen-tools is not
> > > particularly easy to fix.  That doesn't mean it shouldn't be reported
> > > as a bug.
> > 
> > There is nothing to fix. System administrator should provide a proper
> > template -- that's what xen-tools developers most likely to say.
> 
> I'm not sure what you mean here.  We are using the config template
> provided by xen-tools.  The reuse of the host's initrd is suggested by
> the xen-create-image documentation.
> 

TBH I don't know what to write in a report. What is the expected
behaviour?
Ian Jackson Oct. 20, 2017, 3:10 p.m. UTC | #9
Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> On Fri, Oct 20, 2017 at 02:42:53PM +0100, Ian Jackson wrote:
> > Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > I'm not sure what you mean here.  We are using the config template
> > provided by xen-tools.  The reuse of the host's initrd is suggested by
> > the xen-create-image documentation.
> 
> TBH I don't know what to write in a report. What is the expected
> behaviour?

How about this ?  Adjust to taste, correcting any
lies/misunderstandings, maybe adding references, etc.


Steps to reproduce:

  Follow instructions in xen-create-image, including suggestion to
  re-use host's initrd, to make a guest running stretch.  Try to boot
  it.

Expected behaviour:

  Guest boots promptly.

Observed behaviour:

  Guest pauses for a long time during boot waiting for the "resume
  device" to become available.

Analysis:

  This is because the initrd has the host's swap device configured.

  With stretch, and systemd, the delay is longer; earler Debian
  releases had a similar issue but the shorter timeout meant it was a
  less serious problem.

Suggested remedy:

  It is unfortunate that the initrd has the resume device baked into
  it.  Normally this kind of thing (eg root=) comes via the command
  line.  So maybe the bug is in the initrd generation.

  Alternatively, maybe xen-create-image should write a setting "extra"
  into the domain config file "resume=<swap device>".


Ian.
Wei Liu Oct. 20, 2017, 3:36 p.m. UTC | #10
On Fri, Oct 20, 2017 at 04:10:16PM +0100, Ian Jackson wrote:
> Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > On Fri, Oct 20, 2017 at 02:42:53PM +0100, Ian Jackson wrote:
> > > Wei Liu writes ("Re: [OSSTEST PATCH 11/16] ts-debian-fixup: use correct resume device"):
> > > I'm not sure what you mean here.  We are using the config template
> > > provided by xen-tools.  The reuse of the host's initrd is suggested by
> > > the xen-create-image documentation.
> > 
> > TBH I don't know what to write in a report. What is the expected
> > behaviour?
> 
> How about this ?  Adjust to taste, correcting any
> lies/misunderstandings, maybe adding references, etc.

Actually, there is already a Debian bug. See #784810.
diff mbox

Patch

diff --git a/ts-debian-fixup b/ts-debian-fixup
index d476467..fe37c30 100755
--- a/ts-debian-fixup
+++ b/ts-debian-fixup
@@ -176,6 +176,18 @@  sub otherfixupcfg () {
     }
 
 
+    # There might be stale entries in /etc/initramfs-tools/conf.d/resume
+    # which get stored in the initramfs. That introduces delay in guest booting
+    # which might cause tests to fail.
+    #
+    # This is particularly prominent in systemd enabled releases when it tries
+    # to scan for the inexistent device(s) for a long time.
+    #
+    # Override that in kernel command line with the correct swap partition.
+    $cfg =~ m/'phy:.+-swap,(xvda\d+),.*'/;
+    $extra .= " resume=/dev/$1";
+    logm("change resume device to $1");
+
     $cfg =~ s/^extra.*//mg;
     $cfg .= "\nextra='$extra'\n";
 };