debian stretch dom0 + xen 4.9 fails to boot

Message ID	593E8BC70200007800161DEB@prv-mh.provo.novell.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xen.org> Message-Id: <593E8BC70200007800161DEB@prv-mh.provo.novell.com> Date: Mon, 12 Jun 2017 04:40:39 -0600 From: "Jan Beulich" <JBeulich@suse.com> To: "Paul Durrant" <Paul.Durrant@citrix.com> References: <bcc0d4330d9b463a9829af1327d895a9@AMSPEX02CL03.citrite.net> <ad450ab0147147429a46cd7382a17c19@AMSPEX02CL03.citrite.net> <0e704f5e-5845-dc56-6058-d0451d43d842@citrix.com> <2baad09e48864a06873037240b8e39dd@AMSPEX02CL03.citrite.net> <5938048A02000078001604AE@prv-mh.provo.novell.com> <056e566f141c4715867e2bdbbe418977@AMSPEX02CL03.citrite.net> <593806FC02000078001604D1@prv-mh.provo.novell.com> <e4aae59cbd884e21ab1fb75a96f34d3d@AMSPEX02CL03.citrite.net> <593813E602000078001605F7@prv-mh.provo.novell.com> <ee93e45315794a3db4aa21beefb62e52@AMSPEX02CL03.citrite.net> <593838D20200007800160859@prv-mh.provo.novell.com> <fc6fa02bb41e497e8b1896b692e82123@AMSPEX02CL03.citrite.net> <59383D4F020000780016089F@prv-mh.provo.novell.com> <94bf1caf8b95436fa7b3aed74a172ce1@AMSPEX02CL03.citrite.net> <59396AE10200007800160D0E@prv-mh.provo.novell.com> <ea5cd4c8f80349fb9cc768d981fea4a5@AMSPEX02CL03.citrite.net> <593AB92202000078001615A1@prv-mh.provo.novell.com> <40009cf2-ef28-4c70-410e-029b6ac8ffb8@oracle.com> <f579ca3d79c34baab75f63a3810975bb@AMSPEX02CL03.citrite.net> <593ADDBC020000780016171F@prv-mh.provo.novell.com> <c394e22eb2d24f379e34b402b69c3bb6@AMSPEX02CL03.citrite.net> <86a3251e9ac44a2bb2df23862e458ee0@AMSPEX02CL03.citrite.net> In-Reply-To: <86a3251e9ac44a2bb2df23862e458ee0@AMSPEX02CL03.citrite.net> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=__PartD7EFBDB7.2__=" Cc: Juergen Gross <jgross@suse.com>, Andrew Cooper <Andrew.Cooper3@citrix.com>, "Julien Grall \(julien.grall@arm.com\)" <julien.grall@arm.com>, 'Boris Ostrovsky' <boris.ostrovsky@oracle.com>, "xen-devel\(xen-devel@lists.xenproject.org\)" <xen-devel@lists.xenproject.org> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot Precedence: list Errors-To: xen-devel-bounces@lists.xen.org Sender: "Xen-devel" <xen-devel-bounces@lists.xen.org>

Jan Beulich June 12, 2017, 10:40 a.m. UTC

>>> On 12.06.17 at 10:14, <Paul.Durrant@citrix.com> wrote:
> Looking at the code in arch/x86/boot/edd.c in Linux, it sector aligns the 
> buffer into which it reads the MBR and the sector size is pulled from the EDD 
> which means, I believe, that the MBR read on the skull canyon would be 4k 
> aligned.
> 
> What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment is 
> clearly easy and would work around this BIOS issue but, as you say, it does 
> grow the image. Reverting Juergen's patch also works round the issue, but 
> that is more by luck. Re-working the code is preferable, but I guess it's too 
> late to introduce such code-churn in 4.9.

Reverting Jürgen's code is out of question with all the information
you've gathered by now. I think re-working the EDD code slightly
is the best option. Would you mind giving the attached patch a
try? This still slightly grows the trampoline due to a few more
instructions being needed, but should still be far better than
embedding a whole 4k buffer (and then later finding a BIOS/disk
combination which wants even more). Note that I've left a tiny
bit of debugging code in there.

Jan
TODO: remove //temp-s

We place the trampoline no lower than at 256k, so we have ample space
to read the MBRs of BIOS disks into an aligned buffer right below the
trampoline (not doing so has been found to be a problem on a buggy BIOS
coming with a Skull Canyon NUC). To facilitate that move MBR reading
past EDD info retrieval.

Also add a wrap check to the EDD info retrieval loop, to match that in
the MBR reading one.

Reported-by: Paul Durrant <Paul.Durrant@citrix.com>
---
Using 512-byte sector size as default right now - perhaps worth
considering to use 4k instead. I'm also not sure whether we shouldn't
sanity check the sector size some more.

Paul Durrant June 12, 2017, 10:44 a.m. UTC | #1

> -----Original Message-----

> From: Jan Beulich [mailto:JBeulich@suse.com]

> Sent: 12 June 2017 11:41

> To: Paul Durrant <Paul.Durrant@citrix.com>

> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> <jgross@suse.com>

> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> >>> On 12.06.17 at 10:14, <Paul.Durrant@citrix.com> wrote:

> > Looking at the code in arch/x86/boot/edd.c in Linux, it sector aligns the

> > buffer into which it reads the MBR and the sector size is pulled from the

> EDD

> > which means, I believe, that the MBR read on the skull canyon would be 4k

> > aligned.

> >

> > What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment is

> > clearly easy and would work around this BIOS issue but, as you say, it does

> > grow the image. Reverting Juergen's patch also works round the issue, but

> > that is more by luck. Re-working the code is preferable, but I guess it's too

> > late to introduce such code-churn in 4.9.

> 

> Reverting Jürgen's code is out of question with all the information

> you've gathered by now. I think re-working the EDD code slightly

> is the best option. Would you mind giving the attached patch a

> try? This still slightly grows the trampoline due to a few more

> instructions being needed, but should still be far better than

> embedding a whole 4k buffer (and then later finding a BIOS/disk

> combination which wants even more). Note that I've left a tiny

> bit of debugging code in there.

> 


Sure, I'll give that a go now.

  Paul

> Jan

Paul Durrant June 12, 2017, 10:53 a.m. UTC | #2

> -----Original Message-----

[snip]
> > >

> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment is

> > > clearly easy and would work around this BIOS issue but, as you say, it

> does

> > > grow the image. Reverting Juergen's patch also works round the issue,

> but

> > > that is more by luck. Re-working the code is preferable, but I guess it's

> too

> > > late to introduce such code-churn in 4.9.

> >

> > Reverting Jürgen's code is out of question with all the information

> > you've gathered by now. I think re-working the EDD code slightly

> > is the best option. Would you mind giving the attached patch a

> > try? This still slightly grows the trampoline due to a few more

> > instructions being needed, but should still be far better than

> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> > combination which wants even more). Note that I've left a tiny

> > bit of debugging code in there.

> >

> 

> Sure, I'll give that a go now.

> 


That worked fine:

(XEN) MBR[80] @ 85e0 (86000)

so you can add my Tested-by to that.

Thanks,

  Paul

>   Paul

> 

> > Jan

> _______________________________________________

> Xen-devel mailing list

> Xen-devel@lists.xen.org

> https://lists.xen.org/xen-devel

Jan Beulich June 12, 2017, 11:12 a.m. UTC | #3

>>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
> [snip]
>> > >
>> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment is
>> > > clearly easy and would work around this BIOS issue but, as you say, it
>> does
>> > > grow the image. Reverting Juergen's patch also works round the issue,
>> but
>> > > that is more by luck. Re-working the code is preferable, but I guess it's
>> too
>> > > late to introduce such code-churn in 4.9.
>> >
>> > Reverting Jürgen's code is out of question with all the information
>> > you've gathered by now. I think re-working the EDD code slightly
>> > is the best option. Would you mind giving the attached patch a
>> > try? This still slightly grows the trampoline due to a few more
>> > instructions being needed, but should still be far better than
>> > embedding a whole 4k buffer (and then later finding a BIOS/disk
>> > combination which wants even more). Note that I've left a tiny
>> > bit of debugging code in there.
>> >
>> 
>> Sure, I'll give that a go now.
>> 
> 
> That worked fine:
> 
> (XEN) MBR[80] @ 85e0 (86000)

But that's contrary to your earlier findings: Didn't you say simply
avoiding a 4k-boundary wasn't enough? And it certainly tells us
that this isn't a 4k drive (or at least the BIOS doesn't surface 4k
sectors) - I was really expecting a larger gap between the two
logged values.

> so you can add my Tested-by to that.

I.e. I'm not sure about this, as I'm still uncertain whether some
corruption didn't again occur. Of course APs coming up properly
would already be a relatively good sign (as now the permanent
part of the trampoline would be the predestined area for
corruption to occur in).

Jan

Paul Durrant June 12, 2017, 12:05 p.m. UTC | #4

> -----Original Message-----

> From: Jan Beulich [mailto:JBeulich@suse.com]

> Sent: 12 June 2017 12:12

> To: Paul Durrant <Paul.Durrant@citrix.com>

> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> <jgross@suse.com>

> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:

> >>  -----Original Message-----

> > [snip]

> >> > >

> >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment

> is

> >> > > clearly easy and would work around this BIOS issue but, as you say, it

> >> does

> >> > > grow the image. Reverting Juergen's patch also works round the issue,

> >> but

> >> > > that is more by luck. Re-working the code is preferable, but I guess it's

> >> too

> >> > > late to introduce such code-churn in 4.9.

> >> >

> >> > Reverting Jürgen's code is out of question with all the information

> >> > you've gathered by now. I think re-working the EDD code slightly

> >> > is the best option. Would you mind giving the attached patch a

> >> > try? This still slightly grows the trampoline due to a few more

> >> > instructions being needed, but should still be far better than

> >> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> >> > combination which wants even more). Note that I've left a tiny

> >> > bit of debugging code in there.

> >> >

> >>

> >> Sure, I'll give that a go now.

> >>

> >

> > That worked fine:

> >

> > (XEN) MBR[80] @ 85e0 (86000)

> 

> But that's contrary to your earlier findings: Didn't you say simply

> avoiding a 4k-boundary wasn't enough? And it certainly tells us

> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k

> sectors) - I was really expecting a larger gap between the two

> logged values.

> 


I'll go dump out the edd and double check what it is saying.

My findings indicated that the problem seemed to be doing a read that spanned a 4k boundary caused a problem, so using 0x85e00 would be safe. The anomaly was that simply aligning the edd_info buffer and a 512 byte boundary and continuing to use that for reading did not work.
 
> > so you can add my Tested-by to that.

> 

> I.e. I'm not sure about this, as I'm still uncertain whether some

> corruption didn't again occur. Of course APs coming up properly

> would already be a relatively good sign (as now the permanent

> part of the trampoline would be the predestined area for

> corruption to occur in).

> 


None of my findings ever indicated memory corruption (although there, of course, may have been some that I happened to miss), but rather misbehaviour of the int13 handler itself - either locking up, having odd effects (e.g. black screen), or both.

  Paul

> Jan

Paul Durrant June 12, 2017, 12:25 p.m. UTC | #5

> -----Original Message-----

> From: Paul Durrant

> Sent: 12 June 2017 13:06

> To: 'Jan Beulich' <JBeulich@suse.com>

> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> <jgross@suse.com>

> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> > -----Original Message-----

> > From: Jan Beulich [mailto:JBeulich@suse.com]

> > Sent: 12 June 2017 12:12

> > To: Paul Durrant <Paul.Durrant@citrix.com>

> > Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> > Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> > devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> > Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> > <jgross@suse.com>

> > Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> >

> > >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:

> > >>  -----Original Message-----

> > > [snip]

> > >> > >

> > >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k

> alignment

> > is

> > >> > > clearly easy and would work around this BIOS issue but, as you say, it

> > >> does

> > >> > > grow the image. Reverting Juergen's patch also works round the

> issue,

> > >> but

> > >> > > that is more by luck. Re-working the code is preferable, but I guess

> it's

> > >> too

> > >> > > late to introduce such code-churn in 4.9.

> > >> >

> > >> > Reverting Jürgen's code is out of question with all the information

> > >> > you've gathered by now. I think re-working the EDD code slightly

> > >> > is the best option. Would you mind giving the attached patch a

> > >> > try? This still slightly grows the trampoline due to a few more

> > >> > instructions being needed, but should still be far better than

> > >> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> > >> > combination which wants even more). Note that I've left a tiny

> > >> > bit of debugging code in there.

> > >> >

> > >>

> > >> Sure, I'll give that a go now.

> > >>

> > >

> > > That worked fine:

> > >

> > > (XEN) MBR[80] @ 85e0 (86000)

> >

> > But that's contrary to your earlier findings: Didn't you say simply

> > avoiding a 4k-boundary wasn't enough? And it certainly tells us

> > that this isn't a 4k drive (or at least the BIOS doesn't surface 4k

> > sectors) - I was really expecting a larger gap between the two

> > logged values.

> >

> 

> I'll go dump out the edd and double check what it is saying.

> 


I dumped a bit of the info:

(XEN) device 0x80 version 0x30
(XEN) number_of_sectors = 0x1dcf32b0
(XEN) sectors_per_track = 0x3f
(XEN) bytes_per_sector = 0x200

So it is indeed advertising a 512 byte sector. It is an SSD though so it'll be something much bigger underneath.

  Paul

> My findings indicated that the problem seemed to be doing a read that

> spanned a 4k boundary caused a problem, so using 0x85e00 would be safe.

> The anomaly was that simply aligning the edd_info buffer and a 512 byte

> boundary and continuing to use that for reading did not work.

> 

> > > so you can add my Tested-by to that.

> >

> > I.e. I'm not sure about this, as I'm still uncertain whether some

> > corruption didn't again occur. Of course APs coming up properly

> > would already be a relatively good sign (as now the permanent

> > part of the trampoline would be the predestined area for

> > corruption to occur in).

> >

> 

> None of my findings ever indicated memory corruption (although there, of

> course, may have been some that I happened to miss), but rather

> misbehaviour of the int13 handler itself - either locking up, having odd

> effects (e.g. black screen), or both.

> 

>   Paul

> 

> > Jan

Jan Beulich June 12, 2017, 1:54 p.m. UTC | #6

>>> On 12.06.17 at 14:05, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 12 June 2017 12:12
>> To: Paul Durrant <Paul.Durrant@citrix.com>
>> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew
>> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-
>> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris
>> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross
>> <jgross@suse.com>
>> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
>> 
>> >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:
>> >>  -----Original Message-----
>> > [snip]
>> >> > >
>> >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k alignment
>> is
>> >> > > clearly easy and would work around this BIOS issue but, as you say, it
>> >> does
>> >> > > grow the image. Reverting Juergen's patch also works round the issue,
>> >> but
>> >> > > that is more by luck. Re-working the code is preferable, but I guess it's
>> >> too
>> >> > > late to introduce such code-churn in 4.9.
>> >> >
>> >> > Reverting Jürgen's code is out of question with all the information
>> >> > you've gathered by now. I think re-working the EDD code slightly
>> >> > is the best option. Would you mind giving the attached patch a
>> >> > try? This still slightly grows the trampoline due to a few more
>> >> > instructions being needed, but should still be far better than
>> >> > embedding a whole 4k buffer (and then later finding a BIOS/disk
>> >> > combination which wants even more). Note that I've left a tiny
>> >> > bit of debugging code in there.
>> >> >
>> >>
>> >> Sure, I'll give that a go now.
>> >>
>> >
>> > That worked fine:
>> >
>> > (XEN) MBR[80] @ 85e0 (86000)
>> 
>> But that's contrary to your earlier findings: Didn't you say simply
>> avoiding a 4k-boundary wasn't enough? And it certainly tells us
>> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k
>> sectors) - I was really expecting a larger gap between the two
>> logged values.
>> 
> 
> I'll go dump out the edd and double check what it is saying.
> 
> My findings indicated that the problem seemed to be doing a read that 
> spanned a 4k boundary caused a problem, so using 0x85e00 would be safe. The 
> anomaly was that simply aligning the edd_info buffer and a 512 byte boundary 
> and continuing to use that for reading did not work.

But a 512-byte aligned 512-byte buffer can't possibly cross a page
boundary.

>> > so you can add my Tested-by to that.
>> 
>> I.e. I'm not sure about this, as I'm still uncertain whether some
>> corruption didn't again occur. Of course APs coming up properly
>> would already be a relatively good sign (as now the permanent
>> part of the trampoline would be the predestined area for
>> corruption to occur in).
>> 
> 
> None of my findings ever indicated memory corruption (although there, of 
> course, may have been some that I happened to miss), but rather misbehaviour 
> of the int13 handler itself - either locking up, having odd effects (e.g. 
> black screen), or both.

Ah, I didn't understand it this way so far, and instead had implied
that the handler did return, but corrupt our trampoline area in
one way or another.

Jan

Paul Durrant June 12, 2017, 2:28 p.m. UTC | #7

> -----Original Message-----

> From: Jan Beulich [mailto:JBeulich@suse.com]

> Sent: 12 June 2017 14:55

> To: Paul Durrant <Paul.Durrant@citrix.com>

> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> <jgross@suse.com>

> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> >>> On 12.06.17 at 14:05, <Paul.Durrant@citrix.com> wrote:

> >>  -----Original Message-----

> >> From: Jan Beulich [mailto:JBeulich@suse.com]

> >> Sent: 12 June 2017 12:12

> >> To: Paul Durrant <Paul.Durrant@citrix.com>

> >> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> >> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> >> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> >> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> >> <jgross@suse.com>

> >> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> >>

> >> >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:

> >> >>  -----Original Message-----

> >> > [snip]

> >> >> > >

> >> >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k

> alignment

> >> is

> >> >> > > clearly easy and would work around this BIOS issue but, as you say,

> it

> >> >> does

> >> >> > > grow the image. Reverting Juergen's patch also works round the

> issue,

> >> >> but

> >> >> > > that is more by luck. Re-working the code is preferable, but I guess

> it's

> >> >> too

> >> >> > > late to introduce such code-churn in 4.9.

> >> >> >

> >> >> > Reverting Jürgen's code is out of question with all the information

> >> >> > you've gathered by now. I think re-working the EDD code slightly

> >> >> > is the best option. Would you mind giving the attached patch a

> >> >> > try? This still slightly grows the trampoline due to a few more

> >> >> > instructions being needed, but should still be far better than

> >> >> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> >> >> > combination which wants even more). Note that I've left a tiny

> >> >> > bit of debugging code in there.

> >> >> >

> >> >>

> >> >> Sure, I'll give that a go now.

> >> >>

> >> >

> >> > That worked fine:

> >> >

> >> > (XEN) MBR[80] @ 85e0 (86000)

> >>

> >> But that's contrary to your earlier findings: Didn't you say simply

> >> avoiding a 4k-boundary wasn't enough? And it certainly tells us

> >> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k

> >> sectors) - I was really expecting a larger gap between the two

> >> logged values.

> >>

> >

> > I'll go dump out the edd and double check what it is saying.

> >

> > My findings indicated that the problem seemed to be doing a read that

> > spanned a 4k boundary caused a problem, so using 0x85e00 would be safe.

> The

> > anomaly was that simply aligning the edd_info buffer and a 512 byte

> boundary

> > and continuing to use that for reading did not work.

> 

> But a 512-byte aligned 512-byte buffer can't possibly cross a page

> boundary.


Indeed, which is why I was perplexed. I found that 0x60e00 was ok. Your patch chose 0x85e00, which was ok too, but for some reason a '.align 512' in front of boot_edd_info yielded an address which was not ok. I just checked what address that yielded though (by booting with edd=off to avoid the hang) and it was 0x86f40... which clearly means that '.align 512' is not doing what I thought it would do.

  Paul

> 

> >> > so you can add my Tested-by to that.

> >>

> >> I.e. I'm not sure about this, as I'm still uncertain whether some

> >> corruption didn't again occur. Of course APs coming up properly

> >> would already be a relatively good sign (as now the permanent

> >> part of the trampoline would be the predestined area for

> >> corruption to occur in).

> >>

> >

> > None of my findings ever indicated memory corruption (although there, of

> > course, may have been some that I happened to miss), but rather

> misbehaviour

> > of the int13 handler itself - either locking up, having odd effects (e.g.

> > black screen), or both.

> 

> Ah, I didn't understand it this way so far, and instead had implied

> that the handler did return, but corrupt our trampoline area in

> one way or another.

> 

> Jan

Paul Durrant June 12, 2017, 2:43 p.m. UTC | #8

> -----Original Message-----

> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of

> Paul Durrant

> Sent: 12 June 2017 15:29

> To: 'Jan Beulich' <JBeulich@suse.com>

> Cc: Juergen Gross <jgross@suse.com>; Andrew Cooper

> <Andrew.Cooper3@citrix.com>; Julien Grall (julien.grall@arm.com)

> <julien.grall@arm.com>; 'Boris Ostrovsky' <boris.ostrovsky@oracle.com>;

> xen-devel(xen-devel@lists.xenproject.org) <xen-

> devel@lists.xenproject.org>

> Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> > -----Original Message-----

> > From: Jan Beulich [mailto:JBeulich@suse.com]

> > Sent: 12 June 2017 14:55

> > To: Paul Durrant <Paul.Durrant@citrix.com>

> > Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> > Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> > devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> > Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> > <jgross@suse.com>

> > Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> >

> > >>> On 12.06.17 at 14:05, <Paul.Durrant@citrix.com> wrote:

> > >>  -----Original Message-----

> > >> From: Jan Beulich [mailto:JBeulich@suse.com]

> > >> Sent: 12 June 2017 12:12

> > >> To: Paul Durrant <Paul.Durrant@citrix.com>

> > >> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> > >> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> > >> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> > >> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> > >> <jgross@suse.com>

> > >> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> > >>

> > >> >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:

> > >> >>  -----Original Message-----

> > >> > [snip]

> > >> >> > >

> > >> >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k

> > alignment

> > >> is

> > >> >> > > clearly easy and would work around this BIOS issue but, as you

> say,

> > it

> > >> >> does

> > >> >> > > grow the image. Reverting Juergen's patch also works round the

> > issue,

> > >> >> but

> > >> >> > > that is more by luck. Re-working the code is preferable, but I

> guess

> > it's

> > >> >> too

> > >> >> > > late to introduce such code-churn in 4.9.

> > >> >> >

> > >> >> > Reverting Jürgen's code is out of question with all the information

> > >> >> > you've gathered by now. I think re-working the EDD code slightly

> > >> >> > is the best option. Would you mind giving the attached patch a

> > >> >> > try? This still slightly grows the trampoline due to a few more

> > >> >> > instructions being needed, but should still be far better than

> > >> >> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> > >> >> > combination which wants even more). Note that I've left a tiny

> > >> >> > bit of debugging code in there.

> > >> >> >

> > >> >>

> > >> >> Sure, I'll give that a go now.

> > >> >>

> > >> >

> > >> > That worked fine:

> > >> >

> > >> > (XEN) MBR[80] @ 85e0 (86000)

> > >>

> > >> But that's contrary to your earlier findings: Didn't you say simply

> > >> avoiding a 4k-boundary wasn't enough? And it certainly tells us

> > >> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k

> > >> sectors) - I was really expecting a larger gap between the two

> > >> logged values.

> > >>

> > >

> > > I'll go dump out the edd and double check what it is saying.

> > >

> > > My findings indicated that the problem seemed to be doing a read that

> > > spanned a 4k boundary caused a problem, so using 0x85e00 would be

> safe.

> > The

> > > anomaly was that simply aligning the edd_info buffer and a 512 byte

> > boundary

> > > and continuing to use that for reading did not work.

> >

> > But a 512-byte aligned 512-byte buffer can't possibly cross a page

> > boundary.

> 

> Indeed, which is why I was perplexed. I found that 0x60e00 was ok. Your

> patch chose 0x85e00, which was ok too, but for some reason a '.align 512' in

> front of boot_edd_info yielded an address which was not ok. I just checked

> what address that yielded though (by booting with edd=off to avoid the

> hang) and it was 0x86f40... which clearly means that '.align 512' is not doing

> what I thought it would do.


No, the problem turns out to be the GLOBAL() macro which, in assembly files, contains an implicit .align 16!

  Paul

> 

>   Paul

> 

> >

> > >> > so you can add my Tested-by to that.

> > >>

> > >> I.e. I'm not sure about this, as I'm still uncertain whether some

> > >> corruption didn't again occur. Of course APs coming up properly

> > >> would already be a relatively good sign (as now the permanent

> > >> part of the trampoline would be the predestined area for

> > >> corruption to occur in).

> > >>

> > >

> > > None of my findings ever indicated memory corruption (although there,

> of

> > > course, may have been some that I happened to miss), but rather

> > misbehaviour

> > > of the int13 handler itself - either locking up, having odd effects (e.g.

> > > black screen), or both.

> >

> > Ah, I didn't understand it this way so far, and instead had implied

> > that the handler did return, but corrupt our trampoline area in

> > one way or another.

> >

> > Jan

> _______________________________________________

> Xen-devel mailing list

> Xen-devel@lists.xen.org

> https://lists.xen.org/xen-devel

Paul Durrant June 12, 2017, 3:03 p.m. UTC | #9

> -----Original Message-----

> From: Paul Durrant

> Sent: 12 June 2017 15:43

> To: Paul Durrant <Paul.Durrant@citrix.com>; 'Jan Beulich'

> <JBeulich@suse.com>

> Cc: Juergen Gross <jgross@suse.com>; Andrew Cooper

> <Andrew.Cooper3@citrix.com>; Julien Grall (julien.grall@arm.com)

> <julien.grall@arm.com>; 'Boris Ostrovsky' <boris.ostrovsky@oracle.com>;

> xen-devel(xen-devel@lists.xenproject.org) <xen-

> devel@lists.xenproject.org>

> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> 

> > -----Original Message-----

> > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of

> > Paul Durrant

> > Sent: 12 June 2017 15:29

> > To: 'Jan Beulich' <JBeulich@suse.com>

> > Cc: Juergen Gross <jgross@suse.com>; Andrew Cooper

> > <Andrew.Cooper3@citrix.com>; Julien Grall (julien.grall@arm.com)

> > <julien.grall@arm.com>; 'Boris Ostrovsky' <boris.ostrovsky@oracle.com>;

> > xen-devel(xen-devel@lists.xenproject.org) <xen-

> > devel@lists.xenproject.org>

> > Subject: Re: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> >

> > > -----Original Message-----

> > > From: Jan Beulich [mailto:JBeulich@suse.com]

> > > Sent: 12 June 2017 14:55

> > > To: Paul Durrant <Paul.Durrant@citrix.com>

> > > Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew

> > > Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> > > devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> > > Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> > > <jgross@suse.com>

> > > Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> > >

> > > >>> On 12.06.17 at 14:05, <Paul.Durrant@citrix.com> wrote:

> > > >>  -----Original Message-----

> > > >> From: Jan Beulich [mailto:JBeulich@suse.com]

> > > >> Sent: 12 June 2017 12:12

> > > >> To: Paul Durrant <Paul.Durrant@citrix.com>

> > > >> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>;

> Andrew

> > > >> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-

> > > >> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris

> > > >> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross

> > > >> <jgross@suse.com>

> > > >> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot

> > > >>

> > > >> >>> On 12.06.17 at 12:53, <Paul.Durrant@citrix.com> wrote:

> > > >> >>  -----Original Message-----

> > > >> > [snip]

> > > >> >> > >

> > > >> >> > > What do you think it best to do for Xen 4.9? Hardcoding a 4k

> > > alignment

> > > >> is

> > > >> >> > > clearly easy and would work around this BIOS issue but, as you

> > say,

> > > it

> > > >> >> does

> > > >> >> > > grow the image. Reverting Juergen's patch also works round

> the

> > > issue,

> > > >> >> but

> > > >> >> > > that is more by luck. Re-working the code is preferable, but I

> > guess

> > > it's

> > > >> >> too

> > > >> >> > > late to introduce such code-churn in 4.9.

> > > >> >> >

> > > >> >> > Reverting Jürgen's code is out of question with all the information

> > > >> >> > you've gathered by now. I think re-working the EDD code slightly

> > > >> >> > is the best option. Would you mind giving the attached patch a

> > > >> >> > try? This still slightly grows the trampoline due to a few more

> > > >> >> > instructions being needed, but should still be far better than

> > > >> >> > embedding a whole 4k buffer (and then later finding a BIOS/disk

> > > >> >> > combination which wants even more). Note that I've left a tiny

> > > >> >> > bit of debugging code in there.

> > > >> >> >

> > > >> >>

> > > >> >> Sure, I'll give that a go now.

> > > >> >>

> > > >> >

> > > >> > That worked fine:

> > > >> >

> > > >> > (XEN) MBR[80] @ 85e0 (86000)

> > > >>

> > > >> But that's contrary to your earlier findings: Didn't you say simply

> > > >> avoiding a 4k-boundary wasn't enough? And it certainly tells us

> > > >> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k

> > > >> sectors) - I was really expecting a larger gap between the two

> > > >> logged values.

> > > >>

> > > >

> > > > I'll go dump out the edd and double check what it is saying.

> > > >

> > > > My findings indicated that the problem seemed to be doing a read that

> > > > spanned a 4k boundary caused a problem, so using 0x85e00 would be

> > safe.

> > > The

> > > > anomaly was that simply aligning the edd_info buffer and a 512 byte

> > > boundary

> > > > and continuing to use that for reading did not work.

> > >

> > > But a 512-byte aligned 512-byte buffer can't possibly cross a page

> > > boundary.

> >

> > Indeed, which is why I was perplexed. I found that 0x60e00 was ok. Your

> > patch chose 0x85e00, which was ok too, but for some reason a '.align 512' in

> > front of boot_edd_info yielded an address which was not ok. I just checked

> > what address that yielded though (by booting with edd=off to avoid the

> > hang) and it was 0x86f40... which clearly means that '.align 512' is not doing

> > what I thought it would do.

> 

> No, the problem turns out to be the GLOBAL() macro which, in assembly

> files, contains an implicit .align 16!

> 


No, I misread.. ENTRY() contains the implicit align.

It's clearly even more subtle. Running objdump tells me the symbol is indeed 512 byte aligned, but when it ends up on memory it's clearly not. So I guess it must be down to how the trampoline is loaded. Thus, not using a buffer within the trampoline image is most definitely the best idea.

  Paul

>   Paul

> 

> >

> >   Paul

> >

> > >

> > > >> > so you can add my Tested-by to that.

> > > >>

> > > >> I.e. I'm not sure about this, as I'm still uncertain whether some

> > > >> corruption didn't again occur. Of course APs coming up properly

> > > >> would already be a relatively good sign (as now the permanent

> > > >> part of the trampoline would be the predestined area for

> > > >> corruption to occur in).

> > > >>

> > > >

> > > > None of my findings ever indicated memory corruption (although

> there,

> > of

> > > > course, may have been some that I happened to miss), but rather

> > > misbehaviour

> > > > of the int13 handler itself - either locking up, having odd effects (e.g.

> > > > black screen), or both.

> > >

> > > Ah, I didn't understand it this way so far, and instead had implied

> > > that the handler did return, but corrupt our trampoline area in

> > > one way or another.

> > >

> > > Jan

> > _______________________________________________

> > Xen-devel mailing list

> > Xen-devel@lists.xen.org

> > https://lists.xen.org/xen-devel

Jan Beulich June 12, 2017, 3:07 p.m. UTC | #10

>>> On 12.06.17 at 16:43, <Paul.Durrant@citrix.com> wrote:
>> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
>> Paul Durrant
>> Sent: 12 June 2017 15:29
>> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> > Sent: 12 June 2017 14:55
>> > >> > That worked fine:
>> > >> >
>> > >> > (XEN) MBR[80] @ 85e0 (86000)
>> > >>
>> > >> But that's contrary to your earlier findings: Didn't you say simply
>> > >> avoiding a 4k-boundary wasn't enough? And it certainly tells us
>> > >> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k
>> > >> sectors) - I was really expecting a larger gap between the two
>> > >> logged values.
>> > >>
>> > >
>> > > I'll go dump out the edd and double check what it is saying.
>> > >
>> > > My findings indicated that the problem seemed to be doing a read that
>> > > spanned a 4k boundary caused a problem, so using 0x85e00 would be
>> safe.
>> > The
>> > > anomaly was that simply aligning the edd_info buffer and a 512 byte
>> > boundary
>> > > and continuing to use that for reading did not work.
>> >
>> > But a 512-byte aligned 512-byte buffer can't possibly cross a page
>> > boundary.
>> 
>> Indeed, which is why I was perplexed. I found that 0x60e00 was ok. Your
>> patch chose 0x85e00, which was ok too, but for some reason a '.align 512' in
>> front of boot_edd_info yielded an address which was not ok. I just checked
>> what address that yielded though (by booting with edd=off to avoid the
>> hang) and it was 0x86f40... which clearly means that '.align 512' is not doing
>> what I thought it would do.
> 
> No, the problem turns out to be the GLOBAL() macro which, in assembly files, 
> contains an implicit .align 16!

No, I don't think so - two successive .align don't have any bad effect,
the higher value will be it. Instead I think you're suffering from the
copying of the trampoline space to low memory: What is aligned to a
512-byte boundary in the image won't necessarily be in low memory,
unless trampoline_start is also aligned at least as much.

But with this likely having been the problem in your experiments I'm
not feeling sufficiently reassured to submit the patch "officially".

Jan

Jan

Paul Durrant June 12, 2017, 3:21 p.m. UTC | #11

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 12 June 2017 16:08
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: Julien Grall (julien.grall@arm.com) <julien.grall@arm.com>; Andrew
> Cooper <Andrew.Cooper3@citrix.com>; xen-devel(xen-
> devel@lists.xenproject.org) <xen-devel@lists.xenproject.org>; 'Boris
> Ostrovsky' <boris.ostrovsky@oracle.com>; Juergen Gross
> <jgross@suse.com>
> Subject: RE: [Xen-devel] debian stretch dom0 + xen 4.9 fails to boot
> 
> >>> On 12.06.17 at 16:43, <Paul.Durrant@citrix.com> wrote:
> >> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> >> Paul Durrant
> >> Sent: 12 June 2017 15:29
> >> > From: Jan Beulich [mailto:JBeulich@suse.com]
> >> > Sent: 12 June 2017 14:55
> >> > >> > That worked fine:
> >> > >> >
> >> > >> > (XEN) MBR[80] @ 85e0 (86000)
> >> > >>
> >> > >> But that's contrary to your earlier findings: Didn't you say simply
> >> > >> avoiding a 4k-boundary wasn't enough? And it certainly tells us
> >> > >> that this isn't a 4k drive (or at least the BIOS doesn't surface 4k
> >> > >> sectors) - I was really expecting a larger gap between the two
> >> > >> logged values.
> >> > >>
> >> > >
> >> > > I'll go dump out the edd and double check what it is saying.
> >> > >
> >> > > My findings indicated that the problem seemed to be doing a read
> that
> >> > > spanned a 4k boundary caused a problem, so using 0x85e00 would be
> >> safe.
> >> > The
> >> > > anomaly was that simply aligning the edd_info buffer and a 512 byte
> >> > boundary
> >> > > and continuing to use that for reading did not work.
> >> >
> >> > But a 512-byte aligned 512-byte buffer can't possibly cross a page
> >> > boundary.
> >>
> >> Indeed, which is why I was perplexed. I found that 0x60e00 was ok. Your
> >> patch chose 0x85e00, which was ok too, but for some reason a '.align 512'
> in
> >> front of boot_edd_info yielded an address which was not ok. I just
> checked
> >> what address that yielded though (by booting with edd=off to avoid the
> >> hang) and it was 0x86f40... which clearly means that '.align 512' is not
> doing
> >> what I thought it would do.
> >
> > No, the problem turns out to be the GLOBAL() macro which, in assembly
> files,
> > contains an implicit .align 16!
> 
> No, I don't think so - two successive .align don't have any bad effect,
> the higher value will be it. Instead I think you're suffering from the
> copying of the trampoline space to low memory: What is aligned to a
> 512-byte boundary in the image won't necessarily be in low memory,
> unless trampoline_start is also aligned at least as much.
> 
> But with this likely having been the problem in your experiments I'm
> not feeling sufficiently reassured to submit the patch "officially".
> 

I see you submitted the patch.

I'm happy now because the anomaly in what I was seeing is explained. I was convinced that, at some stage, I had found that the image was 64k aligned in memory. I was clearly wrong.

  Paul

> Jan
> 
> Jan

debian stretch dom0 + xen 4.9 fails to boot

Commit Message

Comments

Patch