[v2,13/16] xen-blkback: Implement diskseq checks

Message ID	20230530203116.2008-14-demi@invisiblethingslab.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Feedback-ID: iac594737:Fastmail From: Demi Marie Obenour <demi@invisiblethingslab.com> To: Jens Axboe <axboe@kernel.dk>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Alasdair Kergon <agk@redhat.com>, Mike Snitzer <snitzer@kernel.org>, dm-devel@redhat.com Cc: Demi Marie Obenour <demi@invisiblethingslab.com>, =?utf-8?q?Marek_Marczy?= =?utf-8?q?kowski-G=C3=B3recki?= <marmarek@invisiblethingslab.com>, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, xen-devel@lists.xenproject.org Subject: [PATCH v2 13/16] xen-blkback: Implement diskseq checks Date: Tue, 30 May 2023 16:31:13 -0400 Message-Id: <20230530203116.2008-14-demi@invisiblethingslab.com> In-Reply-To: <20230530203116.2008-1-demi@invisiblethingslab.com> References: <20230530203116.2008-1-demi@invisiblethingslab.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Diskseq support in loop, device-mapper, and blkback \| expand [v2,00/16] Diskseq support in loop, device-mapper, and blkback [v2,01/16] device-mapper: Check that target specs are sufficiently aligned [v2,02/16] device-mapper: Avoid pointer arithmetic overflow [v2,03/16] device-mapper: do not allow targets to overlap 'struct dm_ioctl' [v2,04/16] device-mapper: Better error message for too-short target spec [v2,05/16] device-mapper: Target parameters must not overlap next target spec [v2,06/16] device-mapper: Avoid double-fetch of version [v2,07/16] device-mapper: Allow userspace to opt-in to strict parameter checks [v2,08/16] device-mapper: Allow userspace to provide expected diskseq [v2,09/16] device-mapper: Allow userspace to suppress uevent generation [v2,10/16] device-mapper: Refuse to create device named "control" [v2,11/16] device-mapper: "." and ".." are not valid symlink names [v2,12/16] device-mapper: inform caller about already-existing device [v2,13/16] xen-blkback: Implement diskseq checks [v2,14/16] block, loop: Increment diskseq when releasing a loop device [v2,15/16] xen-blkback: Minor cleanups [v2,16/16] xen-blkback: Inform userspace that device has been opened

Demi Marie Obenour May 30, 2023, 8:31 p.m. UTC

This allows specifying a disk sequence number in XenStore.  If it does
not match the disk sequence number of the underlying device, the device
will not be exported and a warning will be logged.  Userspace can use
this to eliminate race conditions due to major/minor number reuse.
Old kernels do not support the new syntax, but a later patch will allow
userspace to discover that the new syntax is supported.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
---
 drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
 1 file changed, 89 insertions(+), 23 deletions(-)

Roger Pau Monné June 6, 2023, 8:25 a.m. UTC | #1

On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> This allows specifying a disk sequence number in XenStore.  If it does
> not match the disk sequence number of the underlying device, the device
> will not be exported and a warning will be logged.  Userspace can use
> this to eliminate race conditions due to major/minor number reuse.
> Old kernels do not support the new syntax, but a later patch will allow
> userspace to discover that the new syntax is supported.
> 
> Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> ---
>  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
>  1 file changed, 89 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> --- a/drivers/block/xen-blkback/xenbus.c
> +++ b/drivers/block/xen-blkback/xenbus.c
> @@ -24,6 +24,7 @@ struct backend_info {
>  	struct xenbus_watch	backend_watch;
>  	unsigned		major;
>  	unsigned		minor;
> +	unsigned long long	diskseq;

Since diskseq is declared as u64 in gendisk, better use the same type
here too?

>  	char			*mode;
>  };
>  
> @@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)
>  
>  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  			  unsigned major, unsigned minor, int readonly,
> -			  int cdrom)
> +			  bool cdrom, u64 diskseq)
>  {
>  	struct xen_vbd *vbd;
>  	struct block_device *bdev;
> @@ -507,6 +508,26 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
>  		xen_vbd_free(vbd);
>  		return -ENOENT;
>  	}
> +
> +	if (diskseq) {
> +		struct gendisk *disk = bdev->bd_disk;

const.

> +
> +		if (unlikely(disk == NULL)) {
> +			pr_err("%s: device %08x has no gendisk\n",
> +			       __func__, vbd->pdevice);
> +			xen_vbd_free(vbd);
> +			return -EFAULT;

ENODEV or ENOENT might be more accurate IMO.

> +		}
> +
> +		if (unlikely(disk->diskseq != diskseq)) {
> +			pr_warn("%s: device %08x has incorrect sequence "
> +				"number 0x%llx (expected 0x%llx)\n",

I prefer %#llx, and likely pr_err like above.  Also I think it's now
preferred to not split printed lines, so that `grep "has incorrect
sequence number" ...` can find the instance.

> +				__func__, vbd->pdevice, disk->diskseq, diskseq);
> +			xen_vbd_free(vbd);
> +			return -ENODEV;
> +		}
> +	}
> +
>  	vbd->size = vbd_sz(vbd);
>  
>  	if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
> @@ -707,6 +728,9 @@ static void backend_changed(struct xenbus_watch *watch,
>  	int cdrom = 0;
>  	unsigned long handle;
>  	char *device_type;
> +	char *diskseq_str = NULL;

const, and I think there's no need to init to NULL.

> +	int diskseq_len;

unsigned int

> +	unsigned long long diskseq;

u64

>  
>  	pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
>  
> @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
>  		return;
>  	}
>  
> -	if (be->major | be->minor) {
> -		if (be->major != major || be->minor != minor)
> -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> -				be->major, be->minor, major, minor);
> +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> +	if (IS_ERR(diskseq_str)) {
> +		int err = PTR_ERR(diskseq_str);
> +		diskseq_str = NULL;
> +
> +		/*
> +		 * If this does not exist, it means legacy userspace that does not
> +		 * support diskseq.
> +		 */
> +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> +			xenbus_dev_fatal(dev, err, "reading diskseq");
> +			return;
> +		}
> +		diskseq = 0;
> +	} else if (diskseq_len <= 0) {
> +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> +		goto fail;
> +	} else if (diskseq_len > 16) {
> +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> +				 diskseq_len);
> +		goto fail;
> +	} else if (diskseq_str[0] == '0') {
> +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> +		goto fail;
> +	} else {
> +		char *diskseq_end;
> +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> +		if (diskseq_end != diskseq_str + diskseq_len) {
> +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> +			goto fail;
> +		}
> +		kfree(diskseq_str);
> +		diskseq_str = NULL;
> +	}

Won't it be simpler to use xenbus_scanf() with %llx formatter?

Also, we might want to fetch "physical-device" and "diskseq" inside
the same xenstore transaction.

Also, you tie this logic to the "physical-device" watch, which
strictly implies that the "diskseq" node must be written to xenstore
before the "physical-device" node.  This seems fragile, but I don't
see much better optiono since the "diskseq" is optional.

The node and its behaviour should be documented in blkif.h.

> +	if (be->major | be->minor | be->diskseq) {
> +		if (be->major != major || be->minor != minor || be->diskseq != diskseq)
> +			pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
> +				" not supported.\n",
> +				be->major, be->minor, be->diskseq, major, minor, diskseq);
>  		return;

You are leaking diskseq_str here, and in all the error cases between
here and up to the call to xen_vbd_create().

It might be better to simnply free diskseq_str once you are done with
the processing, and have set diskseq.

Otherwise see my suggestion of using xenbus_scanf().

Thanks, Roger.

Demi Marie Obenour June 6, 2023, 5:01 p.m. UTC | #2

On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > This allows specifying a disk sequence number in XenStore.  If it does
> > not match the disk sequence number of the underlying device, the device
> > will not be exported and a warning will be logged.  Userspace can use
> > this to eliminate race conditions due to major/minor number reuse.
> > Old kernels do not support the new syntax, but a later patch will allow
> > userspace to discover that the new syntax is supported.
> > 
> > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > ---
> >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> >  1 file changed, 89 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -24,6 +24,7 @@ struct backend_info {
> >  	struct xenbus_watch	backend_watch;
> >  	unsigned		major;
> >  	unsigned		minor;
> > +	unsigned long long	diskseq;
> 
> Since diskseq is declared as u64 in gendisk, better use the same type
> here too?

simple_strtoull() returns an unsigned long long, and C permits unsigned
long long to be larger than 64 bits.

> >  	char			*mode;
> >  };
> >  
> > @@ -479,7 +480,7 @@ static void xen_vbd_free(struct xen_vbd *vbd)
> >  
> >  static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
> >  			  unsigned major, unsigned minor, int readonly,
> > -			  int cdrom)
> > +			  bool cdrom, u64 diskseq)
> >  {
> >  	struct xen_vbd *vbd;
> >  	struct block_device *bdev;
> > @@ -507,6 +508,26 @@ static int xen_vbd_create(struct xen_blkif *blkif, blkif_vdev_t handle,
> >  		xen_vbd_free(vbd);
> >  		return -ENOENT;
> >  	}
> > +
> > +	if (diskseq) {
> > +		struct gendisk *disk = bdev->bd_disk;
> 
> const.
> 
> > +
> > +		if (unlikely(disk == NULL)) {
> > +			pr_err("%s: device %08x has no gendisk\n",
> > +			       __func__, vbd->pdevice);
> > +			xen_vbd_free(vbd);
> > +			return -EFAULT;
> 
> ENODEV or ENOENT might be more accurate IMO.

I will drop it, as this turns out to be unreachable code.

> > +		}
> > +
> > +		if (unlikely(disk->diskseq != diskseq)) {
> > +			pr_warn("%s: device %08x has incorrect sequence "
> > +				"number 0x%llx (expected 0x%llx)\n",
> 
> I prefer %#llx, and likely pr_err like above.  Also I think it's now
> preferred to not split printed lines, so that `grep "has incorrect
> sequence number" ...` can find the instance.

Ah, so _that_ is why I got a warning from checkpatch!

> > +				__func__, vbd->pdevice, disk->diskseq, diskseq);
> > +			xen_vbd_free(vbd);
> > +			return -ENODEV;
> > +		}
> > +	}
> > +
> >  	vbd->size = vbd_sz(vbd);
> >  
> >  	if (cdrom || disk_to_cdi(vbd->bdev->bd_disk))
> > @@ -707,6 +728,9 @@ static void backend_changed(struct xenbus_watch *watch,
> >  	int cdrom = 0;
> >  	unsigned long handle;
> >  	char *device_type;
> > +	char *diskseq_str = NULL;
> 
> const, and I think there's no need to init to NULL.
> 
> > +	int diskseq_len;
> 
> unsigned int
> 
> > +	unsigned long long diskseq;
> 
> u64
> 
> >  
> >  	pr_debug("%s %p %d\n", __func__, dev, dev->otherend_id);
> >  
> > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> >  		return;
> >  	}
> >  
> > -	if (be->major | be->minor) {
> > -		if (be->major != major || be->minor != minor)
> > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > -				be->major, be->minor, major, minor);
> > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > +	if (IS_ERR(diskseq_str)) {
> > +		int err = PTR_ERR(diskseq_str);
> > +		diskseq_str = NULL;
> > +
> > +		/*
> > +		 * If this does not exist, it means legacy userspace that does not
> > +		 * support diskseq.
> > +		 */
> > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > +			return;
> > +		}
> > +		diskseq = 0;
> > +	} else if (diskseq_len <= 0) {
> > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > +		goto fail;
> > +	} else if (diskseq_len > 16) {
> > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > +				 diskseq_len);
> > +		goto fail;
> > +	} else if (diskseq_str[0] == '0') {
> > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > +		goto fail;
> > +	} else {
> > +		char *diskseq_end;
> > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > +			goto fail;
> > +		}
> > +		kfree(diskseq_str);
> > +		diskseq_str = NULL;
> > +	}
> 
> Won't it be simpler to use xenbus_scanf() with %llx formatter?

xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
really should not.  Should this be fixed in xenbus_scanf()?

> Also, we might want to fetch "physical-device" and "diskseq" inside
> the same xenstore transaction.

Should the rest of the xenstore reads be included in the same
transaction?

> Also, you tie this logic to the "physical-device" watch, which
> strictly implies that the "diskseq" node must be written to xenstore
> before the "physical-device" node.  This seems fragile, but I don't
> see much better optiono since the "diskseq" is optional.

What about including the diskseq in the "physical-device" node?  Perhaps
use diskseq@major:minor syntax?

> The node and its behaviour should be documented in blkif.h.

Indeed so.

> > +	if (be->major | be->minor | be->diskseq) {
> > +		if (be->major != major || be->minor != minor || be->diskseq != diskseq)
> > +			pr_warn("changing physical device (from %x:%x:%llx to %x:%x:%llx)"
> > +				" not supported.\n",
> > +				be->major, be->minor, be->diskseq, major, minor, diskseq);
> >  		return;
> 
> You are leaking diskseq_str here, and in all the error cases between
> here and up to the call to xen_vbd_create().

I will fix this by moving the diskseq reading code into its own
function.

Roger Pau Monné June 7, 2023, 8:20 a.m. UTC | #3

On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > This allows specifying a disk sequence number in XenStore.  If it does
> > > not match the disk sequence number of the underlying device, the device
> > > will not be exported and a warning will be logged.  Userspace can use
> > > this to eliminate race conditions due to major/minor number reuse.
> > > Old kernels do not support the new syntax, but a later patch will allow
> > > userspace to discover that the new syntax is supported.
> > > 
> > > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > > ---
> > >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> > >  1 file changed, 89 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > > --- a/drivers/block/xen-blkback/xenbus.c
> > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > @@ -24,6 +24,7 @@ struct backend_info {
> > >  	struct xenbus_watch	backend_watch;
> > >  	unsigned		major;
> > >  	unsigned		minor;
> > > +	unsigned long long	diskseq;
> > 
> > Since diskseq is declared as u64 in gendisk, better use the same type
> > here too?
> 
> simple_strtoull() returns an unsigned long long, and C permits unsigned
> long long to be larger than 64 bits.

Right, but the type of gendisk is u64.  It's fine if you want to store
the result of simple_strtoull() into an unsigned long long and do
whatever checks to assert it matches the format expected by gendisk,
but ultimately the field type would better use u64 for consistency IMO.

> > > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> > >  		return;
> > >  	}
> > >  
> > > -	if (be->major | be->minor) {
> > > -		if (be->major != major || be->minor != minor)
> > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > -				be->major, be->minor, major, minor);
> > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > +	if (IS_ERR(diskseq_str)) {
> > > +		int err = PTR_ERR(diskseq_str);
> > > +		diskseq_str = NULL;
> > > +
> > > +		/*
> > > +		 * If this does not exist, it means legacy userspace that does not
> > > +		 * support diskseq.
> > > +		 */
> > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > +			return;
> > > +		}
> > > +		diskseq = 0;
> > > +	} else if (diskseq_len <= 0) {
> > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > +		goto fail;
> > > +	} else if (diskseq_len > 16) {
> > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > +				 diskseq_len);
> > > +		goto fail;
> > > +	} else if (diskseq_str[0] == '0') {
> > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > +		goto fail;
> > > +	} else {
> > > +		char *diskseq_end;
> > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > +			goto fail;
> > > +		}
> > > +		kfree(diskseq_str);
> > > +		diskseq_str = NULL;
> > > +	}
> > 
> > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> 
> xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> really should not.  Should this be fixed in xenbus_scanf()?

That would be my preference, so that you can use it here instead of
kind of open-coding it.

> > Also, we might want to fetch "physical-device" and "diskseq" inside
> > the same xenstore transaction.
> 
> Should the rest of the xenstore reads be included in the same
> transaction?

I guess it would make the code simpler to indeed fetch everything
inside the same transaction.

> > Also, you tie this logic to the "physical-device" watch, which
> > strictly implies that the "diskseq" node must be written to xenstore
> > before the "physical-device" node.  This seems fragile, but I don't
> > see much better optiono since the "diskseq" is optional.
> 
> What about including the diskseq in the "physical-device" node?  Perhaps
> use diskseq@major:minor syntax?

Hm, how would you know whether the blkback instance in the kernel
supports the diskseq syntax in physical-device?

Can you fetch a disk using a diskseq identifier?

Why I understand that this is an extra safety check in order to assert
blkback is opening the intended device, is this attempting to fix some
existing issue?

I'm not sure I see how the major:minor numbers would point to a
different device than the one specified by the toolstack unless the
admin explicitly messes with the devices before blkback has got time
to open them.  But then the admin can already do pretty much
everything it wants with the system.

Thanks, Roger.

Demi Marie Obenour June 7, 2023, 4:14 p.m. UTC | #4

On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > This allows specifying a disk sequence number in XenStore.  If it does
> > > > not match the disk sequence number of the underlying device, the device
> > > > will not be exported and a warning will be logged.  Userspace can use
> > > > this to eliminate race conditions due to major/minor number reuse.
> > > > Old kernels do not support the new syntax, but a later patch will allow
> > > > userspace to discover that the new syntax is supported.
> > > > 
> > > > Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
> > > > ---
> > > >  drivers/block/xen-blkback/xenbus.c | 112 +++++++++++++++++++++++------
> > > >  1 file changed, 89 insertions(+), 23 deletions(-)
> > > > 
> > > > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > > > index 4807af1d58059394d7a992335dabaf2bc3901721..9c3eb148fbd802c74e626c3d7bcd69dcb09bd921 100644
> > > > --- a/drivers/block/xen-blkback/xenbus.c
> > > > +++ b/drivers/block/xen-blkback/xenbus.c
> > > > @@ -24,6 +24,7 @@ struct backend_info {
> > > >  	struct xenbus_watch	backend_watch;
> > > >  	unsigned		major;
> > > >  	unsigned		minor;
> > > > +	unsigned long long	diskseq;
> > > 
> > > Since diskseq is declared as u64 in gendisk, better use the same type
> > > here too?
> > 
> > simple_strtoull() returns an unsigned long long, and C permits unsigned
> > long long to be larger than 64 bits.
> 
> Right, but the type of gendisk is u64.  It's fine if you want to store
> the result of simple_strtoull() into an unsigned long long and do
> whatever checks to assert it matches the format expected by gendisk,
> but ultimately the field type would better use u64 for consistency IMO.

I changed my mind on this, not least because the 16-byte length limit
means that the value is limited to UINT64_MAX anyway.

> > > > @@ -725,10 +749,46 @@ static void backend_changed(struct xenbus_watch *watch,
> > > >  		return;
> > > >  	}
> > > >  
> > > > -	if (be->major | be->minor) {
> > > > -		if (be->major != major || be->minor != minor)
> > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > -				be->major, be->minor, major, minor);
> > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > +	if (IS_ERR(diskseq_str)) {
> > > > +		int err = PTR_ERR(diskseq_str);
> > > > +		diskseq_str = NULL;
> > > > +
> > > > +		/*
> > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > +		 * support diskseq.
> > > > +		 */
> > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > +			return;
> > > > +		}
> > > > +		diskseq = 0;
> > > > +	} else if (diskseq_len <= 0) {
> > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > +		goto fail;
> > > > +	} else if (diskseq_len > 16) {
> > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > +				 diskseq_len);
> > > > +		goto fail;
> > > > +	} else if (diskseq_str[0] == '0') {
> > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > +		goto fail;
> > > > +	} else {
> > > > +		char *diskseq_end;
> > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > +			goto fail;
> > > > +		}
> > > > +		kfree(diskseq_str);
> > > > +		diskseq_str = NULL;
> > > > +	}
> > > 
> > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > 
> > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > really should not.  Should this be fixed in xenbus_scanf()?
> 
> That would be my preference, so that you can use it here instead of
> kind of open-coding it.

This winds up being a much more invasive patch as it requires changing
sscanf().  It also has a risk (probably mostly theoretical) of breaking
buggy userspace that passes garbage values here.

> > > Also, we might want to fetch "physical-device" and "diskseq" inside
> > > the same xenstore transaction.
> > 
> > Should the rest of the xenstore reads be included in the same
> > transaction?
> 
> I guess it would make the code simpler to indeed fetch everything
> inside the same transaction.

Okay, will change in v3.

> > > Also, you tie this logic to the "physical-device" watch, which
> > > strictly implies that the "diskseq" node must be written to xenstore
> > > before the "physical-device" node.  This seems fragile, but I don't
> > > see much better optiono since the "diskseq" is optional.
> > 
> > What about including the diskseq in the "physical-device" node?  Perhaps
> > use diskseq@major:minor syntax?
> 
> Hm, how would you know whether the blkback instance in the kernel
> supports the diskseq syntax in physical-device?

That’s what the next patch is for

Roger Pau Monné June 8, 2023, 8:29 a.m. UTC | #5

On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > -	if (be->major | be->minor) {
> > > > > -		if (be->major != major || be->minor != minor)
> > > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > > -				be->major, be->minor, major, minor);
> > > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > > +	if (IS_ERR(diskseq_str)) {
> > > > > +		int err = PTR_ERR(diskseq_str);
> > > > > +		diskseq_str = NULL;
> > > > > +
> > > > > +		/*
> > > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > > +		 * support diskseq.
> > > > > +		 */
> > > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > > +			return;
> > > > > +		}
> > > > > +		diskseq = 0;
> > > > > +	} else if (diskseq_len <= 0) {
> > > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > > +		goto fail;
> > > > > +	} else if (diskseq_len > 16) {
> > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > > +				 diskseq_len);
> > > > > +		goto fail;
> > > > > +	} else if (diskseq_str[0] == '0') {
> > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > > +		goto fail;
> > > > > +	} else {
> > > > > +		char *diskseq_end;
> > > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > > +			goto fail;
> > > > > +		}
> > > > > +		kfree(diskseq_str);
> > > > > +		diskseq_str = NULL;
> > > > > +	}
> > > > 
> > > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > > 
> > > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > > really should not.  Should this be fixed in xenbus_scanf()?
> > 
> > That would be my preference, so that you can use it here instead of
> > kind of open-coding it.
> 
> This winds up being a much more invasive patch as it requires changing
> sscanf().  It also has a risk (probably mostly theoretical) of breaking
> buggy userspace that passes garbage values here.

Well, if the current function is not suitable for your purposes it
would be better to fix it rather than open-code what you need.  Mostly
because further usages would then also need to open-code whatever
required.

> > > > Also, you tie this logic to the "physical-device" watch, which
> > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > see much better optiono since the "diskseq" is optional.
> > > 
> > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > use diskseq@major:minor syntax?
> > 
> > Hm, how would you know whether the blkback instance in the kernel
> > supports the diskseq syntax in physical-device?
> 
> That’s what the next patch is for

Demi Marie Obenour June 8, 2023, 3:33 p.m. UTC | #6

On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > -	if (be->major | be->minor) {
> > > > > > -		if (be->major != major || be->minor != minor)
> > > > > > -			pr_warn("changing physical device (from %x:%x to %x:%x) not supported.\n",
> > > > > > -				be->major, be->minor, major, minor);
> > > > > > +	diskseq_str = xenbus_read(XBT_NIL, dev->nodename, "diskseq", &diskseq_len);
> > > > > > +	if (IS_ERR(diskseq_str)) {
> > > > > > +		int err = PTR_ERR(diskseq_str);
> > > > > > +		diskseq_str = NULL;
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If this does not exist, it means legacy userspace that does not
> > > > > > +		 * support diskseq.
> > > > > > +		 */
> > > > > > +		if (unlikely(!XENBUS_EXIST_ERR(err))) {
> > > > > > +			xenbus_dev_fatal(dev, err, "reading diskseq");
> > > > > > +			return;
> > > > > > +		}
> > > > > > +		diskseq = 0;
> > > > > > +	} else if (diskseq_len <= 0) {
> > > > > > +		xenbus_dev_fatal(dev, -EFAULT, "diskseq must not be empty");
> > > > > > +		goto fail;
> > > > > > +	} else if (diskseq_len > 16) {
> > > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq too long: got %d but limit is 16",
> > > > > > +				 diskseq_len);
> > > > > > +		goto fail;
> > > > > > +	} else if (diskseq_str[0] == '0') {
> > > > > > +		xenbus_dev_fatal(dev, -ERANGE, "diskseq must not start with '0'");
> > > > > > +		goto fail;
> > > > > > +	} else {
> > > > > > +		char *diskseq_end;
> > > > > > +		diskseq = simple_strtoull(diskseq_str, &diskseq_end, 16);
> > > > > > +		if (diskseq_end != diskseq_str + diskseq_len) {
> > > > > > +			xenbus_dev_fatal(dev, -EINVAL, "invalid diskseq");
> > > > > > +			goto fail;
> > > > > > +		}
> > > > > > +		kfree(diskseq_str);
> > > > > > +		diskseq_str = NULL;
> > > > > > +	}
> > > > > 
> > > > > Won't it be simpler to use xenbus_scanf() with %llx formatter?
> > > > 
> > > > xenbus_scanf() doesn’t check for overflow and accepts lots of junk it
> > > > really should not.  Should this be fixed in xenbus_scanf()?
> > > 
> > > That would be my preference, so that you can use it here instead of
> > > kind of open-coding it.
> > 
> > This winds up being a much more invasive patch as it requires changing
> > sscanf().  It also has a risk (probably mostly theoretical) of breaking
> > buggy userspace that passes garbage values here.
> 
> Well, if the current function is not suitable for your purposes it
> would be better to fix it rather than open-code what you need.  Mostly
> because further usages would then also need to open-code whatever
> required.

That is fair.

> > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > see much better optiono since the "diskseq" is optional.
> > > > 
> > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > use diskseq@major:minor syntax?
> > > 
> > > Hm, how would you know whether the blkback instance in the kernel
> > > supports the diskseq syntax in physical-device?
> > 
> > That’s what the next patch is for

Roger Pau Monné June 9, 2023, 3:13 p.m. UTC | #7

On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > > see much better optiono since the "diskseq" is optional.
> > > > > 
> > > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > > use diskseq@major:minor syntax?
> > > > 
> > > > Hm, how would you know whether the blkback instance in the kernel
> > > > supports the diskseq syntax in physical-device?
> > > 
> > > That’s what the next patch is for

Demi Marie Obenour June 9, 2023, 4:55 p.m. UTC | #8

On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > On Tue, Jun 06, 2023 at 01:01:20PM -0400, Demi Marie Obenour wrote:
> > > > > > On Tue, Jun 06, 2023 at 10:25:47AM +0200, Roger Pau Monné wrote:
> > > > > > > On Tue, May 30, 2023 at 04:31:13PM -0400, Demi Marie Obenour wrote:
> > > > > > > Also, you tie this logic to the "physical-device" watch, which
> > > > > > > strictly implies that the "diskseq" node must be written to xenstore
> > > > > > > before the "physical-device" node.  This seems fragile, but I don't
> > > > > > > see much better optiono since the "diskseq" is optional.
> > > > > > 
> > > > > > What about including the diskseq in the "physical-device" node?  Perhaps
> > > > > > use diskseq@major:minor syntax?
> > > > > 
> > > > > Hm, how would you know whether the blkback instance in the kernel
> > > > > supports the diskseq syntax in physical-device?
> > > > 
> > > > That’s what the next patch is for

Roger Pau Monné June 12, 2023, 8:09 a.m. UTC | #9

On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > Can you fetch a disk using a diskseq identifier?
> > > > > 
> > > > > Not yet, although I have considered adding this ability.  It would be
> > > > > one step towards a “diskseqfs” that userspace could use to open a device
> > > > > by diskseq.
> > > > > 
> > > > > > Why I understand that this is an extra safety check in order to assert
> > > > > > blkback is opening the intended device, is this attempting to fix some
> > > > > > existing issue?
> > > > > 
> > > > > Yes, it is.  I have a block script (written in C) that validates the
> > > > > device it has opened before passing the information to blkback.  It uses
> > > > > the diskseq to do this, but for that protection to be complete, blkback
> > > > > must also be aware of it.
> > > > 
> > > > But if your block script opens the device, and keeps it open until
> > > > blkback has also taken a reference to it, there's no way such device
> > > > could be removed and recreated in the window you point out above, as
> > > > there's always a reference on it taken?
> > > 
> > > This assumes that the block script is not killed in the meantime,
> > > which is not a safe assumption due to timeouts and the OOM killer.
> > 
> > Doesn't seem very reliable to use with delete-on-close either then.
> 
> That’s actually the purpose of delete-on-close!  It ensures that if the
> block script gets killed, the device is automatically cleaned up.

Block script attach getting killed shouldn't prevent the toolstack
from performing domain destruction, and thus removing the stale block
device.

OTOH if your toolstack gets killed then there's not much that can be
done, and the system will need intervention in order to get back into
a sane state.

Hitting OOM in your control domain however is unlikely to be handled
gracefully, even with delete-on-close.

> > > > Then the block script will open the device by diskseq and pass the
> > > > major:minor numbers to blkback.
> > > 
> > > Alternatively, the toolstack could write both the diskseq and
> > > major:minor numbers and be confident that it is referring to the
> > > correct device, no matter how long ago it got that information.
> > > This could be quite useful for e.g. one VM exporting a device to
> > > another VM by calling losetup(8) and expecting a human to make a
> > > decision based on various properties about the device.  In this
> > > case there is no upper bound on the race window.
> > 
> > Instead of playing with xenstore nodes, it might be better to simply
> > have blkback export on sysfs the diskseq of the opened device, and let
> > the block script check whether that's correct or not.  That implies
> > less code in the kernel side, and doesn't pollute xenstore.
> 
> This would require that blkback delay exposing the device to the
> frontend until the block script has checked that the diskseq is correct.

This depends on your toolstack implementation.  libxl won't start the
domain until block scripts have finished execution, and hence the
block script waiting for the sysfs node to appear and check it against
the expected value would be enough.

> Much simpler for the block script to provide the diskseq in xenstore.
> If you want to avoid an extra xenstore node, I can make the diskseq part
> of the physical-device node.

I'm thinking that we might want to introduce a "physical-device-uuid"
node and use that to provide the diskseq to the backened.  Toolstacks
(or block scripts) would need to be sure the "physical-device-uuid"
node is populated before setting "physical-device", as writes to
that node would still trigger blkback watch.  I think using two
distinct watches would just make the logic in blkback too
complicated.

My preference would be for the kernel to have a function for opening a
device identified by a diskseq (as fetched from
"physical-device-uuid"), so that we don't have to open using
major:minor and then check the diskseq.

Thanks, Roger.

Demi Marie Obenour June 21, 2023, 1:14 a.m. UTC | #10

On Mon, Jun 12, 2023 at 10:09:39AM +0200, Roger Pau Monné wrote:
> On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> > On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > > Can you fetch a disk using a diskseq identifier?
> > > > > > 
> > > > > > Not yet, although I have considered adding this ability.  It would be
> > > > > > one step towards a “diskseqfs” that userspace could use to open a device
> > > > > > by diskseq.
> > > > > > 
> > > > > > > Why I understand that this is an extra safety check in order to assert
> > > > > > > blkback is opening the intended device, is this attempting to fix some
> > > > > > > existing issue?
> > > > > > 
> > > > > > Yes, it is.  I have a block script (written in C) that validates the
> > > > > > device it has opened before passing the information to blkback.  It uses
> > > > > > the diskseq to do this, but for that protection to be complete, blkback
> > > > > > must also be aware of it.
> > > > > 
> > > > > But if your block script opens the device, and keeps it open until
> > > > > blkback has also taken a reference to it, there's no way such device
> > > > > could be removed and recreated in the window you point out above, as
> > > > > there's always a reference on it taken?
> > > > 
> > > > This assumes that the block script is not killed in the meantime,
> > > > which is not a safe assumption due to timeouts and the OOM killer.
> > > 
> > > Doesn't seem very reliable to use with delete-on-close either then.
> > 
> > That’s actually the purpose of delete-on-close!  It ensures that if the
> > block script gets killed, the device is automatically cleaned up.
> 
> Block script attach getting killed shouldn't prevent the toolstack
> from performing domain destruction, and thus removing the stale block
> device.
> 
> OTOH if your toolstack gets killed then there's not much that can be
> done, and the system will need intervention in order to get back into
> a sane state.
> 
> Hitting OOM in your control domain however is unlikely to be handled
> gracefully, even with delete-on-close.

I agree, _but_ we should not make it any harder than necessary.

> > > > > Then the block script will open the device by diskseq and pass the
> > > > > major:minor numbers to blkback.
> > > > 
> > > > Alternatively, the toolstack could write both the diskseq and
> > > > major:minor numbers and be confident that it is referring to the
> > > > correct device, no matter how long ago it got that information.
> > > > This could be quite useful for e.g. one VM exporting a device to
> > > > another VM by calling losetup(8) and expecting a human to make a
> > > > decision based on various properties about the device.  In this
> > > > case there is no upper bound on the race window.
> > > 
> > > Instead of playing with xenstore nodes, it might be better to simply
> > > have blkback export on sysfs the diskseq of the opened device, and let
> > > the block script check whether that's correct or not.  That implies
> > > less code in the kernel side, and doesn't pollute xenstore.
> > 
> > This would require that blkback delay exposing the device to the
> > frontend until the block script has checked that the diskseq is correct.
> 
> This depends on your toolstack implementation.  libxl won't start the
> domain until block scripts have finished execution, and hence the
> block script waiting for the sysfs node to appear and check it against
> the expected value would be enough.

True, but we cannot assume that everyone is using libxl.

> > Much simpler for the block script to provide the diskseq in xenstore.
> > If you want to avoid an extra xenstore node, I can make the diskseq part
> > of the physical-device node.
> 
> I'm thinking that we might want to introduce a "physical-device-uuid"
> node and use that to provide the diskseq to the backened.  Toolstacks
> (or block scripts) would need to be sure the "physical-device-uuid"
> node is populated before setting "physical-device", as writes to
> that node would still trigger blkback watch.  I think using two
> distinct watches would just make the logic in blkback too
> complicated.
> 
> My preference would be for the kernel to have a function for opening a
> device identified by a diskseq (as fetched from
> "physical-device-uuid"), so that we don't have to open using
> major:minor and then check the diskseq.

In theory I agree, but in practice it would be a significantly more
complex patch and given that it does not impact the uAPI I would prefer
the less-invasive option.  Is there anything more that needs to be done
here, other than replacing the "diskseq" name?  I prefer
"physical-device-luid" because the ID is only valid in one particular
VM.

Roger Pau Monné June 21, 2023, 10:07 a.m. UTC | #11

On Tue, Jun 20, 2023 at 09:14:25PM -0400, Demi Marie Obenour wrote:
> On Mon, Jun 12, 2023 at 10:09:39AM +0200, Roger Pau Monné wrote:
> > On Fri, Jun 09, 2023 at 12:55:39PM -0400, Demi Marie Obenour wrote:
> > > On Fri, Jun 09, 2023 at 05:13:45PM +0200, Roger Pau Monné wrote:
> > > > On Thu, Jun 08, 2023 at 11:33:26AM -0400, Demi Marie Obenour wrote:
> > > > > On Thu, Jun 08, 2023 at 10:29:18AM +0200, Roger Pau Monné wrote:
> > > > > > On Wed, Jun 07, 2023 at 12:14:46PM -0400, Demi Marie Obenour wrote:
> > > > > > > On Wed, Jun 07, 2023 at 10:20:08AM +0200, Roger Pau Monné wrote:
> > > > > > Then the block script will open the device by diskseq and pass the
> > > > > > major:minor numbers to blkback.
> > > > > 
> > > > > Alternatively, the toolstack could write both the diskseq and
> > > > > major:minor numbers and be confident that it is referring to the
> > > > > correct device, no matter how long ago it got that information.
> > > > > This could be quite useful for e.g. one VM exporting a device to
> > > > > another VM by calling losetup(8) and expecting a human to make a
> > > > > decision based on various properties about the device.  In this
> > > > > case there is no upper bound on the race window.
> > > > 
> > > > Instead of playing with xenstore nodes, it might be better to simply
> > > > have blkback export on sysfs the diskseq of the opened device, and let
> > > > the block script check whether that's correct or not.  That implies
> > > > less code in the kernel side, and doesn't pollute xenstore.
> > > 
> > > This would require that blkback delay exposing the device to the
> > > frontend until the block script has checked that the diskseq is correct.
> > 
> > This depends on your toolstack implementation.  libxl won't start the
> > domain until block scripts have finished execution, and hence the
> > block script waiting for the sysfs node to appear and check it against
> > the expected value would be enough.
> 
> True, but we cannot assume that everyone is using libxl.

Right, for the udev case this won't be good, since the domain could be
already running, and hence could potentially attach to the backend
before the hotplug script realized the opened device is wrong.
Likewise for hot add disks.

> > > Much simpler for the block script to provide the diskseq in xenstore.
> > > If you want to avoid an extra xenstore node, I can make the diskseq part
> > > of the physical-device node.
> > 
> > I'm thinking that we might want to introduce a "physical-device-uuid"
> > node and use that to provide the diskseq to the backened.  Toolstacks
> > (or block scripts) would need to be sure the "physical-device-uuid"
> > node is populated before setting "physical-device", as writes to
> > that node would still trigger blkback watch.  I think using two
> > distinct watches would just make the logic in blkback too
> > complicated.
> > 
> > My preference would be for the kernel to have a function for opening a
> > device identified by a diskseq (as fetched from
> > "physical-device-uuid"), so that we don't have to open using
> > major:minor and then check the diskseq.
> 
> In theory I agree, but in practice it would be a significantly more
> complex patch and given that it does not impact the uAPI I would prefer
> the less-invasive option.

From a blkback point of view I don't see that option as more invasive,
it's actually the other way around IMO.  On blkback we would use
blkdev_get_by_diskseq() (or equivalent) instead of
blkdev_get_by_dev(), so it would result in an overall simpler
change (because the check against diskseq won't be needed anymore).

> Is there anything more that needs to be done
> here, other than replacing the "diskseq" name?

I think we also spoke about using sscanf to parse the option.

The patch to Xen blkif.h needs to be accepted before the Linux side
can progress.


> I prefer
> "physical-device-luid" because the ID is only valid in one particular
> VM.

"physical-device-uid" then maybe?

Thanks, Roger.

[v2,13/16] xen-blkback: Implement diskseq checks

Commit Message

Comments

Patch