diff mbox series

blkif: reconcile protocol specification with in-use implementations

Message ID 20240903141923.72241-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series blkif: reconcile protocol specification with in-use implementations | expand

Commit Message

Roger Pau Monne Sept. 3, 2024, 2:19 p.m. UTC
Current blkif implementations (both backends and frontends) have all slight
differences about how they handle the 'sector-size' xenstore node, and how
other fields are derived from this value or hardcoded to be expressed in units
of 512 bytes.

To give some context, this is an excerpt of how different implementations use
the value in 'sector-size' as the base unit for to other fields rather than
just to set the logical sector size of the block device:

                        │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Linux blk{front,back}   │         512         │          512           │           512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
QEMU blkback            │     sector-size     │      sector-size       │       sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Windows blkfront        │     sector-size     │      sector-size       │       sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
MiniOS                  │     sector-size     │          512           │           512

An attempt was made by 67e1c050e36b in order to change the base units of the
request fields and the xenstore 'sectors' node.  That however only lead to more
confusion, as the specification now clearly diverged from the reference
implementation in Linux.  Such change was only implemented for QEMU Qdisk
and Windows PV blkfront.

Partially revert to the state before 67e1c050e36b:

 * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
   the node, backends should not make decisions based on its presence.

 * Clarify that 'sectors' xenstore node and the requests fields are always in
   512-byte units, like it was previous to 67e1c050e36b.

All base units for the fields used in the protocol are 512-byte based, the
xenbus 'sector-size' field is only used to signal the logic block size.  When
'sector-size' is greater than 512, blkfront implementations must make sure that
the offsets and sizes (even when expressed in 512-byte units) are aligned to
the logical block size specified in 'sector-size', otherwise the backend will
fail to process the requests.

This will require changes to some of the frontends and backends in order to
properly support 'sector-size' nodes greater than 512.

Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/include/public/io/blkif.h | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

Comments

Jan Beulich Sept. 3, 2024, 2:36 p.m. UTC | #1
On 03.09.2024 16:19, Roger Pau Monne wrote:
> Current blkif implementations (both backends and frontends) have all slight
> differences about how they handle the 'sector-size' xenstore node, and how
> other fields are derived from this value or hardcoded to be expressed in units
> of 512 bytes.
> 
> To give some context, this is an excerpt of how different implementations use
> the value in 'sector-size' as the base unit for to other fields rather than
> just to set the logical sector size of the block device:
> 
>                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> Linux blk{front,back}   │         512         │          512           │           512
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> QEMU blkback            │     sector-size     │      sector-size       │       sector-size
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> Windows blkfront        │     sector-size     │      sector-size       │       sector-size
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> MiniOS                  │     sector-size     │          512           │           512
> 
> An attempt was made by 67e1c050e36b in order to change the base units of the
> request fields and the xenstore 'sectors' node.  That however only lead to more
> confusion, as the specification now clearly diverged from the reference
> implementation in Linux.  Such change was only implemented for QEMU Qdisk
> and Windows PV blkfront.
> 
> Partially revert to the state before 67e1c050e36b:
> 
>  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
>    the node, backends should not make decisions based on its presence.
> 
>  * Clarify that 'sectors' xenstore node and the requests fields are always in
>    512-byte units, like it was previous to 67e1c050e36b.
> 
> All base units for the fields used in the protocol are 512-byte based, the
> xenbus 'sector-size' field is only used to signal the logic block size.  When
> 'sector-size' is greater than 512, blkfront implementations must make sure that
> the offsets and sizes (even when expressed in 512-byte units) are aligned to
> the logical block size specified in 'sector-size', otherwise the backend will
> fail to process the requests.
> 
> This will require changes to some of the frontends and backends in order to
> properly support 'sector-size' nodes greater than 512.
> 
> Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Following the earlier discussion, I was kind of hoping that there would be
at least an outline of some plan here as to (efficiently) dealing with 4k-
sector disks. In the absence of that I'm afraid it is a little harder to
judge whether the proposal here is the best we can do at this point.

Jan

> --- a/xen/include/public/io/blkif.h
> +++ b/xen/include/public/io/blkif.h
> @@ -240,10 +240,6 @@
>   *      The logical block size, in bytes, of the underlying storage. This
>   *      must be a power of two with a minimum value of 512.
>   *
> - *      NOTE: Because of implementation bugs in some frontends this must be
> - *            set to 512, unless the frontend advertizes a non-zero value
> - *            in its "feature-large-sector-size" xenbus node. (See below).
> - *
>   * physical-sector-size
>   *      Values:         <uint32_t>
>   *      Default Value:  <"sector-size">
> @@ -254,8 +250,8 @@
>   * sectors
>   *      Values:         <uint64_t>
>   *
> - *      The size of the backend device, expressed in units of "sector-size".
> - *      The product of "sector-size" and "sectors" must also be an integer
> + *      The size of the backend device, expressed in units of 512b.
> + *      The product of "sector-size" * 512 must also be an integer
>   *      multiple of "physical-sector-size", if that node is present.
>   *
>   *****************************************************************************
> @@ -338,6 +334,7 @@
>   * feature-large-sector-size
>   *      Values:         0/1 (boolean)
>   *      Default Value:  0
> + *      Notes:          DEPRECATED, 12
>   *
>   *      A value of "1" indicates that the frontend will correctly supply and
>   *      interpret all sector-based quantities in terms of the "sector-size"
> @@ -411,6 +408,11 @@
>   *(10) The discard-secure property may be present and will be set to 1 if the
>   *     backing device supports secure discard.
>   *(11) Only used by Linux and NetBSD.
> + *(12) Possibly only ever implemented by the QEMU Qdisk backend and the Windows
> + *     PV block frontend.  Other backends and frontends supported 'sector-size'
> + *     values greater than 512 before such feature was added.  Frontends should
> + *     not expose this node, neither should backends make any decisions based
> + *     on it being exposed by the frontend.
>   */
>  
>  /*
> @@ -621,9 +623,12 @@
>  /*
>   * NB. 'first_sect' and 'last_sect' in blkif_request_segment, as well as
>   * 'sector_number' in blkif_request, blkif_request_discard and
> - * blkif_request_indirect are sector-based quantities. See the description
> - * of the "feature-large-sector-size" frontend xenbus node above for
> - * more information.
> + * blkif_request_indirect are all in units of 512 bytes, regardless of whether the
> + * 'sector-size' xenstore node contains a value greater than 512.
> + *
> + * However the value in those fields must be properly aligned to the logical
> + * sector size reported by the 'sector-size' xenstore node, see 'Backend Device
> + * Properties' section.
>   */
>  struct blkif_request_segment {
>      grant_ref_t gref;        /* reference to I/O buffer frame        */
Roger Pau Monne Sept. 4, 2024, 8:21 a.m. UTC | #2
On Tue, Sep 03, 2024 at 04:36:37PM +0200, Jan Beulich wrote:
> On 03.09.2024 16:19, Roger Pau Monne wrote:
> > Current blkif implementations (both backends and frontends) have all slight
> > differences about how they handle the 'sector-size' xenstore node, and how
> > other fields are derived from this value or hardcoded to be expressed in units
> > of 512 bytes.
> > 
> > To give some context, this is an excerpt of how different implementations use
> > the value in 'sector-size' as the base unit for to other fields rather than
> > just to set the logical sector size of the block device:
> > 
> >                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > Linux blk{front,back}   │         512         │          512           │           512
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > QEMU blkback            │     sector-size     │      sector-size       │       sector-size
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > Windows blkfront        │     sector-size     │      sector-size       │       sector-size
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > MiniOS                  │     sector-size     │          512           │           512
> > 
> > An attempt was made by 67e1c050e36b in order to change the base units of the
> > request fields and the xenstore 'sectors' node.  That however only lead to more
> > confusion, as the specification now clearly diverged from the reference
> > implementation in Linux.  Such change was only implemented for QEMU Qdisk
> > and Windows PV blkfront.
> > 
> > Partially revert to the state before 67e1c050e36b:
> > 
> >  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
> >    the node, backends should not make decisions based on its presence.
> > 
> >  * Clarify that 'sectors' xenstore node and the requests fields are always in
> >    512-byte units, like it was previous to 67e1c050e36b.
> > 
> > All base units for the fields used in the protocol are 512-byte based, the
> > xenbus 'sector-size' field is only used to signal the logic block size.  When
> > 'sector-size' is greater than 512, blkfront implementations must make sure that
> > the offsets and sizes (even when expressed in 512-byte units) are aligned to
> > the logical block size specified in 'sector-size', otherwise the backend will
> > fail to process the requests.
> > 
> > This will require changes to some of the frontends and backends in order to
> > properly support 'sector-size' nodes greater than 512.
> > 
> > Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Following the earlier discussion, I was kind of hoping that there would be
> at least an outline of some plan here as to (efficiently) dealing with 4k-
> sector disks.

What do you mean with efficiently?

4K disks will set 'sector-size' to 4096, so the segments setup by the
frontends in the requests will all be 4K aligned (both address and
size).  Every segment will fill a full blkif_request_segment (making
the last_sect field kind of pointless).

> In the absence of that I'm afraid it is a little harder to
> judge whether the proposal here is the best we can do at this point.

While I don't mind looking at what we can do to better handle 4K
sector disks, we need IMO to revert to the specification before
67e1c050e36b, as that change switched the hardcoded sector based units
from 512 to 'sector-size', thus breaking the existing ABI.

Thanks, Roger.
Paul Durrant Sept. 4, 2024, 8:39 a.m. UTC | #3
On 04/09/2024 09:21, Roger Pau Monné wrote:
>> In the absence of that I'm afraid it is a little harder to
>> judge whether the proposal here is the best we can do at this point.
> 
> While I don't mind looking at what we can do to better handle 4K
> sector disks, we need IMO to revert to the specification before
> 67e1c050e36b, as that change switched the hardcoded sector based units
> from 512 to 'sector-size', thus breaking the existing ABI.
> 

But that's the crux of the problem. What *is* is the ABI? We apparently 
don't have one that all OS subscribe to.
 From your findings it appears that the only thing that will work 
globally is to define 'sector-size' is strictly 512 and deprecate any 
large sector size support of any kind.

   Paul
Roger Pau Monne Sept. 4, 2024, 9:11 a.m. UTC | #4
On Wed, Sep 04, 2024 at 09:39:17AM +0100, Paul Durrant wrote:
> On 04/09/2024 09:21, Roger Pau Monné wrote:
> > > In the absence of that I'm afraid it is a little harder to
> > > judge whether the proposal here is the best we can do at this point.
> > 
> > While I don't mind looking at what we can do to better handle 4K
> > sector disks, we need IMO to revert to the specification before
> > 67e1c050e36b, as that change switched the hardcoded sector based units
> > from 512 to 'sector-size', thus breaking the existing ABI.
> > 
> 
> But that's the crux of the problem. What *is* is the ABI? We apparently
> don't have one that all OS subscribe to.

At least prior to 67e1c050e36b the specification in blkif.h and (what
I consider) the reference implementation in Linux blk{front,back}
matched.  Previous to 67e1c050e36b blkif.h stated:

/*
 * NB. first_sect and last_sect in blkif_request_segment, as well as
 * sector_number in blkif_request, are always expressed in 512-byte units.
 * However they must be properly aligned to the real sector size of the
 * physical disk, which is reported in the "physical-sector-size" node in
 * the backend xenbus info. Also the xenbus "sectors" node is expressed in
 * 512-byte units.
 */

I think it was quite clear, and does in fact match the implementation
in Linux.

> From your findings it appears that the only thing that will work globally is
> to define 'sector-size' is strictly 512 and deprecate any large sector size
> support of any kind.

As said to Anthony, how do we deal with disks with 4K logical sectors?
I'm not really up for implementing read-modify-write in blkback on
Linux, as it would be complex, slow, and likely prone to errors.

We could introduce a new feature (`logical-sector-size` or some such?)
to expose a sector size != 4K, but we might as well just fix existing
implementations to match with the specification, as that's overall
less changes.

In kernel Linux blk{front,back} have always worked fine with 4K sector
disks, and did match the specification prior to 67e1c050e36b.  Let's
clarify the specification as required and fix the remaining front and
backends to adhere to it.

Thanks, Roger.
Jan Beulich Sept. 4, 2024, 9:31 a.m. UTC | #5
On 04.09.2024 10:21, Roger Pau Monné wrote:
> On Tue, Sep 03, 2024 at 04:36:37PM +0200, Jan Beulich wrote:
>> On 03.09.2024 16:19, Roger Pau Monne wrote:
>>> Current blkif implementations (both backends and frontends) have all slight
>>> differences about how they handle the 'sector-size' xenstore node, and how
>>> other fields are derived from this value or hardcoded to be expressed in units
>>> of 512 bytes.
>>>
>>> To give some context, this is an excerpt of how different implementations use
>>> the value in 'sector-size' as the base unit for to other fields rather than
>>> just to set the logical sector size of the block device:
>>>
>>>                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
>>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
>>> FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
>>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
>>> Linux blk{front,back}   │         512         │          512           │           512
>>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
>>> QEMU blkback            │     sector-size     │      sector-size       │       sector-size
>>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
>>> Windows blkfront        │     sector-size     │      sector-size       │       sector-size
>>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
>>> MiniOS                  │     sector-size     │          512           │           512
>>>
>>> An attempt was made by 67e1c050e36b in order to change the base units of the
>>> request fields and the xenstore 'sectors' node.  That however only lead to more
>>> confusion, as the specification now clearly diverged from the reference
>>> implementation in Linux.  Such change was only implemented for QEMU Qdisk
>>> and Windows PV blkfront.
>>>
>>> Partially revert to the state before 67e1c050e36b:
>>>
>>>  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
>>>    the node, backends should not make decisions based on its presence.
>>>
>>>  * Clarify that 'sectors' xenstore node and the requests fields are always in
>>>    512-byte units, like it was previous to 67e1c050e36b.
>>>
>>> All base units for the fields used in the protocol are 512-byte based, the
>>> xenbus 'sector-size' field is only used to signal the logic block size.  When
>>> 'sector-size' is greater than 512, blkfront implementations must make sure that
>>> the offsets and sizes (even when expressed in 512-byte units) are aligned to
>>> the logical block size specified in 'sector-size', otherwise the backend will
>>> fail to process the requests.
>>>
>>> This will require changes to some of the frontends and backends in order to
>>> properly support 'sector-size' nodes greater than 512.
>>>
>>> Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
>>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
>>
>> Following the earlier discussion, I was kind of hoping that there would be
>> at least an outline of some plan here as to (efficiently) dealing with 4k-
>> sector disks.
> 
> What do you mean with efficiently?
> 
> 4K disks will set 'sector-size' to 4096, so the segments setup by the
> frontends in the requests will all be 4K aligned (both address and
> size).

Will they, despite granularity then being 512b?

Perhaps I misunderstood the proposal then, and you're retaining the
ability to have "sector-size" != 512, just that any I/O done is not
supposed to consider that setting. I guess I mis-read the 2nd to last
paragraph of the description; I'm sorry. "even when expressed in 512-
byte units" reads to me as if other units are permissible. Maybe it
was really meant to be "despite being expressed in 512-byte units"?

Jan
Anthony PERARD Sept. 4, 2024, 9:35 a.m. UTC | #6
On Wed, Sep 04, 2024 at 11:11:51AM +0200, Roger Pau Monné wrote:
> On Wed, Sep 04, 2024 at 09:39:17AM +0100, Paul Durrant wrote:
> > On 04/09/2024 09:21, Roger Pau Monné wrote:
> > > > In the absence of that I'm afraid it is a little harder to
> > > > judge whether the proposal here is the best we can do at this point.
> > > 
> > > While I don't mind looking at what we can do to better handle 4K
> > > sector disks, we need IMO to revert to the specification before
> > > 67e1c050e36b, as that change switched the hardcoded sector based units
> > > from 512 to 'sector-size', thus breaking the existing ABI.
> > > 
> > 
> > But that's the crux of the problem. What *is* is the ABI? We apparently
> > don't have one that all OS subscribe to.
> 
> At least prior to 67e1c050e36b the specification in blkif.h and (what
> I consider) the reference implementation in Linux blk{front,back}
> matched.  Previous to 67e1c050e36b blkif.h stated:
> 
> /*
>  * NB. first_sect and last_sect in blkif_request_segment, as well as
>  * sector_number in blkif_request, are always expressed in 512-byte units.
>  * However they must be properly aligned to the real sector size of the
>  * physical disk, which is reported in the "physical-sector-size" node in
>  * the backend xenbus info. Also the xenbus "sectors" node is expressed in
>  * 512-byte units.
>  */
> 
> I think it was quite clear, and does in fact match the implementation
> in Linux.

That's wrong, Linux doesn't match the specification before 67e1c050e36b,
in particular for "sectors":

    sectors
         Values:         <uint64_t>

         The size of the backend device, expressed in units of its logical
         sector size ("sector-size").

The only implementation that matches this specification is MiniOS (and
OMVF).

Oh, I didn't notice that that random comment you quoted that comes from
the middle of the header have a different definition for "sectors" ...

Well, the specification doesn't match with the specification ... and the
only possible way to implement the specification is to only ever set
"sector-size" to 512...

No wonder that they are so many different interpretation of the
protocol.

Cheers,
Roger Pau Monne Sept. 4, 2024, 10:07 a.m. UTC | #7
On Wed, Sep 04, 2024 at 11:31:08AM +0200, Jan Beulich wrote:
> On 04.09.2024 10:21, Roger Pau Monné wrote:
> > On Tue, Sep 03, 2024 at 04:36:37PM +0200, Jan Beulich wrote:
> >> On 03.09.2024 16:19, Roger Pau Monne wrote:
> >>> Current blkif implementations (both backends and frontends) have all slight
> >>> differences about how they handle the 'sector-size' xenstore node, and how
> >>> other fields are derived from this value or hardcoded to be expressed in units
> >>> of 512 bytes.
> >>>
> >>> To give some context, this is an excerpt of how different implementations use
> >>> the value in 'sector-size' as the base unit for to other fields rather than
> >>> just to set the logical sector size of the block device:
> >>>
> >>>                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
> >>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> >>> FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
> >>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> >>> Linux blk{front,back}   │         512         │          512           │           512
> >>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> >>> QEMU blkback            │     sector-size     │      sector-size       │       sector-size
> >>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> >>> Windows blkfront        │     sector-size     │      sector-size       │       sector-size
> >>> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> >>> MiniOS                  │     sector-size     │          512           │           512
> >>>
> >>> An attempt was made by 67e1c050e36b in order to change the base units of the
> >>> request fields and the xenstore 'sectors' node.  That however only lead to more
> >>> confusion, as the specification now clearly diverged from the reference
> >>> implementation in Linux.  Such change was only implemented for QEMU Qdisk
> >>> and Windows PV blkfront.
> >>>
> >>> Partially revert to the state before 67e1c050e36b:
> >>>
> >>>  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
> >>>    the node, backends should not make decisions based on its presence.
> >>>
> >>>  * Clarify that 'sectors' xenstore node and the requests fields are always in
> >>>    512-byte units, like it was previous to 67e1c050e36b.
> >>>
> >>> All base units for the fields used in the protocol are 512-byte based, the
> >>> xenbus 'sector-size' field is only used to signal the logic block size.  When
> >>> 'sector-size' is greater than 512, blkfront implementations must make sure that
> >>> the offsets and sizes (even when expressed in 512-byte units) are aligned to
> >>> the logical block size specified in 'sector-size', otherwise the backend will
> >>> fail to process the requests.
> >>>
> >>> This will require changes to some of the frontends and backends in order to
> >>> properly support 'sector-size' nodes greater than 512.
> >>>
> >>> Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
> >>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> >>
> >> Following the earlier discussion, I was kind of hoping that there would be
> >> at least an outline of some plan here as to (efficiently) dealing with 4k-
> >> sector disks.
> > 
> > What do you mean with efficiently?
> > 
> > 4K disks will set 'sector-size' to 4096, so the segments setup by the
> > frontends in the requests will all be 4K aligned (both address and
> > size).
> 
> Will they, despite granularity then being 512b?

The added text to blkif.h states:

"However the value in those fields must be properly aligned to the logical
sector size reported by the 'sector-size' xenstore node, see 'Backend Device
Properties' section."

'those fields' in the text above refers to the sector based offsets
and sizes in blkif_request & other ring structs.  So while the base
units of the fields are 512-byte based, the resulting offsets and
sizes should be aligned to the value in 'sector-size'.

> Perhaps I misunderstood the proposal then, and you're retaining the
> ability to have "sector-size" != 512, just that any I/O done is not
> supposed to consider that setting.

No, I/O is supposed to consider that setting, is just that the base
unit in the ring structures will always be 512-byte based, regardless
of what 'sector-size' contains.

> I guess I mis-read the 2nd to last
> paragraph of the description; I'm sorry. "even when expressed in 512-
> byte units" reads to me as if other units are permissible. Maybe it
> was really meant to be "despite being expressed in 512-byte units"?

Sure, I will adjust the commit message.

Thanks, Roger.
Roger Pau Monne Sept. 4, 2024, 10:31 a.m. UTC | #8
On Wed, Sep 04, 2024 at 09:35:40AM +0000, Anthony PERARD wrote:
> On Wed, Sep 04, 2024 at 11:11:51AM +0200, Roger Pau Monné wrote:
> > On Wed, Sep 04, 2024 at 09:39:17AM +0100, Paul Durrant wrote:
> > > On 04/09/2024 09:21, Roger Pau Monné wrote:
> > > > > In the absence of that I'm afraid it is a little harder to
> > > > > judge whether the proposal here is the best we can do at this point.
> > > >
> > > > While I don't mind looking at what we can do to better handle 4K
> > > > sector disks, we need IMO to revert to the specification before
> > > > 67e1c050e36b, as that change switched the hardcoded sector based units
> > > > from 512 to 'sector-size', thus breaking the existing ABI.
> > > >
> > >
> > > But that's the crux of the problem. What *is* is the ABI? We apparently
> > > don't have one that all OS subscribe to.
> >
> > At least prior to 67e1c050e36b the specification in blkif.h and (what
> > I consider) the reference implementation in Linux blk{front,back}
> > matched.  Previous to 67e1c050e36b blkif.h stated:
> >
> > /*
> >  * NB. first_sect and last_sect in blkif_request_segment, as well as
> >  * sector_number in blkif_request, are always expressed in 512-byte units.
> >  * However they must be properly aligned to the real sector size of the
> >  * physical disk, which is reported in the "physical-sector-size" node in
> >  * the backend xenbus info. Also the xenbus "sectors" node is expressed in
> >  * 512-byte units.
> >  */
> >
> > I think it was quite clear, and does in fact match the implementation
> > in Linux.
> 
> That's wrong, Linux doesn't match the specification before 67e1c050e36b,
> in particular for "sectors":
> 
>     sectors
>          Values:         <uint64_t>
> 
>          The size of the backend device, expressed in units of its logical
>          sector size ("sector-size").

This was a bug introduced in 2fa701e5346d.  The 'random' comment that
you mention notes that 'sectors' is unconditionally expressed in
512-byte units was added way before, in d05ae13188231.  The improved
documentation added by 2fa701e5346d missed to correctly reflect the
units of the 'sectors' node.

> 
> The only implementation that matches this specification is MiniOS (and
> OMVF).
> 
> Oh, I didn't notice that that random comment you quoted that comes from
> the middle of the header have a different definition for "sectors" ...
> 
> Well, the specification doesn't match with the specification ... and the
> only possible way to implement the specification is to only ever set
> "sector-size" to 512...
> 
> No wonder that they are so many different interpretation of the
> protocol.

My opinion is that there was a bug introduced in the specification in
2fa701e5346d, and that bug was extended by 67e1c050e36b to even more
fields.

Implementations should be fixed to adhere to the specification as it
was pre 2fa701e5346d, because that works correctly with 'sector-size'
!= 512, and is the one implemented in Linux blkfront and blkback.

There's no need to make this more complicated than it is.  We
introduced bugs in blkif.h, and those need to be fixed.  It's sad that
those bugs propagated into implementations, or that bugs from
implementations propagated into blkif.h.

I don't see an option where we get to keep our current diverging
implementations and still support 4K logical sector disks without
specification and code changes.  We could introduce a new way to
signal 4K logical sector sizes, but as that will require modifications
to every frontends and backend we might as well just fix the existing
mess and modify the implementations as required.

Thanks, Roger.
Anthony PERARD Sept. 4, 2024, 1:25 p.m. UTC | #9
On Tue, Sep 03, 2024 at 04:19:23PM +0200, Roger Pau Monne wrote:
> Current blkif implementations (both backends and frontends) have all slight
> differences about how they handle the 'sector-size' xenstore node, and how
> other fields are derived from this value or hardcoded to be expressed in units
> of 512 bytes.
> 
> To give some context, this is an excerpt of how different implementations use
> the value in 'sector-size' as the base unit for to other fields rather than
> just to set the logical sector size of the block device:
> 
>                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> Linux blk{front,back}   │         512         │          512           │           512
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> QEMU blkback            │     sector-size     │      sector-size       │       sector-size
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> Windows blkfront        │     sector-size     │      sector-size       │       sector-size
> ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> MiniOS                  │     sector-size     │          512           │           512
> 
> An attempt was made by 67e1c050e36b in order to change the base units of the
> request fields and the xenstore 'sectors' node.  That however only lead to more
> confusion, as the specification now clearly diverged from the reference
> implementation in Linux.  Such change was only implemented for QEMU Qdisk
> and Windows PV blkfront.
> 
> Partially revert to the state before 67e1c050e36b:
> 
>  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
>    the node, backends should not make decisions based on its presence.
> 
>  * Clarify that 'sectors' xenstore node and the requests fields are always in
>    512-byte units, like it was previous to 67e1c050e36b.
> 
> All base units for the fields used in the protocol are 512-byte based, the
> xenbus 'sector-size' field is only used to signal the logic block size.  When
> 'sector-size' is greater than 512, blkfront implementations must make sure that
> the offsets and sizes (even when expressed in 512-byte units) are aligned to
> the logical block size specified in 'sector-size', otherwise the backend will
> fail to process the requests.
> 
> This will require changes to some of the frontends and backends in order to
> properly support 'sector-size' nodes greater than 512.
> 
> Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')

Probably want to add:
Fixes: 2fa701e5346d ("blkif.h: Provide more complete documentation of the blkif interface")

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
>  xen/include/public/io/blkif.h | 23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> index 22f1eef0c0ca..07708f4d08eb 100644
> --- a/xen/include/public/io/blkif.h
> +++ b/xen/include/public/io/blkif.h
> @@ -240,10 +240,6 @@
>   *      The logical block size, in bytes, of the underlying storage. This
>   *      must be a power of two with a minimum value of 512.

Should we add that "sector-size" is to be used only for request length?


> - *      NOTE: Because of implementation bugs in some frontends this must be
> - *            set to 512, unless the frontend advertizes a non-zero value
> - *            in its "feature-large-sector-size" xenbus node. (See below).
> - *
>   * physical-sector-size
>   *      Values:         <uint32_t>
>   *      Default Value:  <"sector-size">
> @@ -254,8 +250,8 @@
>   * sectors
>   *      Values:         <uint64_t>
>   *
> - *      The size of the backend device, expressed in units of "sector-size".
> - *      The product of "sector-size" and "sectors" must also be an integer
> + *      The size of the backend device, expressed in units of 512b.
> + *      The product of "sector-size" * 512 must also be an integer
>   *      multiple of "physical-sector-size", if that node is present.
>   *
>   *****************************************************************************
> @@ -338,6 +334,7 @@
>   * feature-large-sector-size
>   *      Values:         0/1 (boolean)
>   *      Default Value:  0
> + *      Notes:          DEPRECATED, 12
>   *
>   *      A value of "1" indicates that the frontend will correctly supply and

Could you remove "correctly" from this sentence? It's misleading.

>   *      interpret all sector-based quantities in terms of the "sector-size"
> @@ -411,6 +408,11 @@
>   *(10) The discard-secure property may be present and will be set to 1 if the
>   *     backing device supports secure discard.
>   *(11) Only used by Linux and NetBSD.
> + *(12) Possibly only ever implemented by the QEMU Qdisk backend and the Windows
> + *     PV block frontend.  Other backends and frontends supported 'sector-size'
> + *     values greater than 512 before such feature was added.  Frontends should
> + *     not expose this node, neither should backends make any decisions based
> + *     on it being exposed by the frontend.
>   */
>  
>  /*
> @@ -621,9 +623,12 @@
>  /*
>   * NB. 'first_sect' and 'last_sect' in blkif_request_segment, as well as
>   * 'sector_number' in blkif_request, blkif_request_discard and
> - * blkif_request_indirect are sector-based quantities. See the description
> - * of the "feature-large-sector-size" frontend xenbus node above for
> - * more information.
> + * blkif_request_indirect are all in units of 512 bytes, regardless of whether the
> + * 'sector-size' xenstore node contains a value greater than 512.
> + *
> + * However the value in those fields must be properly aligned to the logical
> + * sector size reported by the 'sector-size' xenstore node, see 'Backend Device
> + * Properties' section.
>   */
>  struct blkif_request_segment {

Textually (that is without reading it) this comment seems to only apply
to `struct blkif_request_segment`. There is an other comment that
separate it from `struct blkif_request` (and it is far away from
blkif_request_discard and blkif_request_indirect). For `struct
blkif_request.sector_number`, the only comment is "start sector idx on
disk" but it's really hard to tell how to interpret it, it could be
interpreted as a "sector-size" quantity because that the size of a
sector on the disk, the underlying storage.

So, I think we need to change the comment of
`blkif_request.sector_number`.


Another thing, there's a "type" `blkif_sector_t` defined at the beginning
of the file, would it be worth it to add a description to it?

Thanks,
Roger Pau Monne Sept. 10, 2024, 9:07 a.m. UTC | #10
On Wed, Sep 04, 2024 at 01:25:46PM +0000, Anthony PERARD wrote:
> On Tue, Sep 03, 2024 at 04:19:23PM +0200, Roger Pau Monne wrote:
> > Current blkif implementations (both backends and frontends) have all slight
> > differences about how they handle the 'sector-size' xenstore node, and how
> > other fields are derived from this value or hardcoded to be expressed in units
> > of 512 bytes.
> >
> > To give some context, this is an excerpt of how different implementations use
> > the value in 'sector-size' as the base unit for to other fields rather than
> > just to set the logical sector size of the block device:
> >
> >                         │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > Linux blk{front,back}   │         512         │          512           │           512
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > QEMU blkback            │     sector-size     │      sector-size       │       sector-size
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > Windows blkfront        │     sector-size     │      sector-size       │       sector-size
> > ────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
> > MiniOS                  │     sector-size     │          512           │           512
> >
> > An attempt was made by 67e1c050e36b in order to change the base units of the
> > request fields and the xenstore 'sectors' node.  That however only lead to more
> > confusion, as the specification now clearly diverged from the reference
> > implementation in Linux.  Such change was only implemented for QEMU Qdisk
> > and Windows PV blkfront.
> >
> > Partially revert to the state before 67e1c050e36b:
> >
> >  * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
> >    the node, backends should not make decisions based on its presence.
> >
> >  * Clarify that 'sectors' xenstore node and the requests fields are always in
> >    512-byte units, like it was previous to 67e1c050e36b.
> >
> > All base units for the fields used in the protocol are 512-byte based, the
> > xenbus 'sector-size' field is only used to signal the logic block size.  When
> > 'sector-size' is greater than 512, blkfront implementations must make sure that
> > the offsets and sizes (even when expressed in 512-byte units) are aligned to
> > the logical block size specified in 'sector-size', otherwise the backend will
> > fail to process the requests.
> >
> > This will require changes to some of the frontends and backends in order to
> > properly support 'sector-size' nodes greater than 512.
> >
> > Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
> 
> Probably want to add:
> Fixes: 2fa701e5346d ("blkif.h: Provide more complete documentation of the blkif interface")
> 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> >  xen/include/public/io/blkif.h | 23 ++++++++++++++---------
> >  1 file changed, 14 insertions(+), 9 deletions(-)
> >
> > diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
> > index 22f1eef0c0ca..07708f4d08eb 100644
> > --- a/xen/include/public/io/blkif.h
> > +++ b/xen/include/public/io/blkif.h
> > @@ -240,10 +240,6 @@
> >   *      The logical block size, in bytes, of the underlying storage. This
> >   *      must be a power of two with a minimum value of 512.
> 
> Should we add that "sector-size" is to be used only for request length?

Yes, that would be fine.

What about:

The logical block size, in bytes, of the underlying storage. This must
be a power of two with a minimum value of 512.  The sector size should
only be used for request segment length and alignment.

> 
> > - *      NOTE: Because of implementation bugs in some frontends this must be
> > - *            set to 512, unless the frontend advertizes a non-zero value
> > - *            in its "feature-large-sector-size" xenbus node. (See below).
> > - *
> >   * physical-sector-size
> >   *      Values:         <uint32_t>
> >   *      Default Value:  <"sector-size">
> > @@ -254,8 +250,8 @@
> >   * sectors
> >   *      Values:         <uint64_t>
> >   *
> > - *      The size of the backend device, expressed in units of "sector-size".
> > - *      The product of "sector-size" and "sectors" must also be an integer
> > + *      The size of the backend device, expressed in units of 512b.
> > + *      The product of "sector-size" * 512 must also be an integer
> >   *      multiple of "physical-sector-size", if that node is present.
> >   *
> >   *****************************************************************************
> > @@ -338,6 +334,7 @@
> >   * feature-large-sector-size
> >   *      Values:         0/1 (boolean)
> >   *      Default Value:  0
> > + *      Notes:          DEPRECATED, 12
> >   *
> >   *      A value of "1" indicates that the frontend will correctly supply and
> 
> Could you remove "correctly" from this sentence? It's misleading.

The whole feature is deprecated, so I would rather leave it as-is for
historical reference.  The added note attempts to reflect that it
should not be exposed by frontends, neither should backends make any
decisions based on its presence.

> >   *      interpret all sector-based quantities in terms of the "sector-size"
> > @@ -411,6 +408,11 @@
> >   *(10) The discard-secure property may be present and will be set to 1 if the
> >   *     backing device supports secure discard.
> >   *(11) Only used by Linux and NetBSD.
> > + *(12) Possibly only ever implemented by the QEMU Qdisk backend and the Windows
> > + *     PV block frontend.  Other backends and frontends supported 'sector-size'
> > + *     values greater than 512 before such feature was added.  Frontends should
> > + *     not expose this node, neither should backends make any decisions based
> > + *     on it being exposed by the frontend.
> >   */
> >
> >  /*
> > @@ -621,9 +623,12 @@
> >  /*
> >   * NB. 'first_sect' and 'last_sect' in blkif_request_segment, as well as
> >   * 'sector_number' in blkif_request, blkif_request_discard and
> > - * blkif_request_indirect are sector-based quantities. See the description
> > - * of the "feature-large-sector-size" frontend xenbus node above for
> > - * more information.
> > + * blkif_request_indirect are all in units of 512 bytes, regardless of whether the
> > + * 'sector-size' xenstore node contains a value greater than 512.
> > + *
> > + * However the value in those fields must be properly aligned to the logical
> > + * sector size reported by the 'sector-size' xenstore node, see 'Backend Device
> > + * Properties' section.
> >   */
> >  struct blkif_request_segment {
> 
> Textually (that is without reading it) this comment seems to only apply
> to `struct blkif_request_segment`. There is an other comment that
> separate it from `struct blkif_request` (and it is far away from
> blkif_request_discard and blkif_request_indirect). For `struct
> blkif_request.sector_number`, the only comment is "start sector idx on
> disk" but it's really hard to tell how to interpret it, it could be
> interpreted as a "sector-size" quantity because that the size of a
> sector on the disk, the underlying storage.
> 
> So, I think we need to change the comment of
> `blkif_request.sector_number`.

OK, will trim a bit the comment in blkif_request_segment then sprinkle
comments about the sectors base units in the different structures
defined below.

> 
> Another thing, there's a "type" `blkif_sector_t` defined at the beginning
> of the file, would it be worth it to add a description to it?

IMO it's better to add the description as close as possible to the
field declaration, rather than the declaration of the field type.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/include/public/io/blkif.h b/xen/include/public/io/blkif.h
index 22f1eef0c0ca..07708f4d08eb 100644
--- a/xen/include/public/io/blkif.h
+++ b/xen/include/public/io/blkif.h
@@ -240,10 +240,6 @@ 
  *      The logical block size, in bytes, of the underlying storage. This
  *      must be a power of two with a minimum value of 512.
  *
- *      NOTE: Because of implementation bugs in some frontends this must be
- *            set to 512, unless the frontend advertizes a non-zero value
- *            in its "feature-large-sector-size" xenbus node. (See below).
- *
  * physical-sector-size
  *      Values:         <uint32_t>
  *      Default Value:  <"sector-size">
@@ -254,8 +250,8 @@ 
  * sectors
  *      Values:         <uint64_t>
  *
- *      The size of the backend device, expressed in units of "sector-size".
- *      The product of "sector-size" and "sectors" must also be an integer
+ *      The size of the backend device, expressed in units of 512b.
+ *      The product of "sector-size" * 512 must also be an integer
  *      multiple of "physical-sector-size", if that node is present.
  *
  *****************************************************************************
@@ -338,6 +334,7 @@ 
  * feature-large-sector-size
  *      Values:         0/1 (boolean)
  *      Default Value:  0
+ *      Notes:          DEPRECATED, 12
  *
  *      A value of "1" indicates that the frontend will correctly supply and
  *      interpret all sector-based quantities in terms of the "sector-size"
@@ -411,6 +408,11 @@ 
  *(10) The discard-secure property may be present and will be set to 1 if the
  *     backing device supports secure discard.
  *(11) Only used by Linux and NetBSD.
+ *(12) Possibly only ever implemented by the QEMU Qdisk backend and the Windows
+ *     PV block frontend.  Other backends and frontends supported 'sector-size'
+ *     values greater than 512 before such feature was added.  Frontends should
+ *     not expose this node, neither should backends make any decisions based
+ *     on it being exposed by the frontend.
  */
 
 /*
@@ -621,9 +623,12 @@ 
 /*
  * NB. 'first_sect' and 'last_sect' in blkif_request_segment, as well as
  * 'sector_number' in blkif_request, blkif_request_discard and
- * blkif_request_indirect are sector-based quantities. See the description
- * of the "feature-large-sector-size" frontend xenbus node above for
- * more information.
+ * blkif_request_indirect are all in units of 512 bytes, regardless of whether the
+ * 'sector-size' xenstore node contains a value greater than 512.
+ *
+ * However the value in those fields must be properly aligned to the logical
+ * sector size reported by the 'sector-size' xenstore node, see 'Backend Device
+ * Properties' section.
  */
 struct blkif_request_segment {
     grant_ref_t gref;        /* reference to I/O buffer frame        */