diff mbox

[RESEND] tools/libxl: add support for emulated NVMe drives

Message ID 1490188195-25209-1-git-send-email-paul.durrant@citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paul Durrant March 22, 2017, 1:09 p.m. UTC
Upstream QEMU supports emulation of NVM Express a.k.a. NVMe drives.

This patch adds a new vdev type into libxl to allow such drives to be
presented to HVM guests. Because the purpose of the new vdev is purely
to configure emulation, the syntax only supports specification of
whole disks. Also there is no need to introduce a new concrete VBD
encoding for NVMe drives.

NOTE: QEMU's emulation only supports a single NVMe namespace, so the
      vdev syntax does not include specification of a namespace.
      Also, current versions of SeaBIOS do not support booting from
      NVMe devices, so the vdev should only be used for secondary
      drives.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>

NOTE FOR COMMITTERS:

Support for unplug of NVMe drives was added into QEMU by the following
commit:

http://git.qemu-project.org/?p=qemu.git;a=commit;h=090fa1c8
---
 docs/man/xen-vbd-interface.markdown.7 | 15 ++++++++-------
 docs/man/xl-disk-configuration.pod.5  |  4 ++--
 tools/libxl/libxl_device.c            |  8 ++++++++
 tools/libxl/libxl_dm.c                |  6 ++++++
 4 files changed, 24 insertions(+), 9 deletions(-)

Comments

Ian Jackson March 22, 2017, 2:16 p.m. UTC | #1
Paul Durrant writes ("[PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> NOTE: QEMU's emulation only supports a single NVMe namespace, so the
>       vdev syntax does not include specification of a namespace.
>       Also, current versions of SeaBIOS do not support booting from
>       NVMe devices, so the vdev should only be used for secondary
>       drives.

If and when qemu supports multiple namespaces, how will this be
specified ?

Ian.
Paul Durrant March 22, 2017, 2:22 p.m. UTC | #2
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 14:17
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: Re: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("[PATCH RESEND] tools/libxl: add support for emulated
> NVMe drives"):
> > NOTE: QEMU's emulation only supports a single NVMe namespace, so the
> >       vdev syntax does not include specification of a namespace.
> >       Also, current versions of SeaBIOS do not support booting from
> >       NVMe devices, so the vdev should only be used for secondary
> >       drives.
> 
> If and when qemu supports multiple namespaces, how will this be
> specified ?

I'd guess a vdev of the form nvmeXnY would follow convention.

  Paul

> 
> Ian.
Ian Jackson March 22, 2017, 3:01 p.m. UTC | #3
Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> [Ian:]
> > If and when qemu supports multiple namespaces, how will this be
> > specified ?
> 
> I'd guess a vdev of the form nvmeXnY would follow convention.

Right.  Sorry, I was unclear.  I meant here:

 +    1 << 28 | disk << 8                  nvme, all disks, whole disk only

Maybe we should leave some room in this bit pattern.  If we do that
then guests only need to be told about this once, even if we support
nvme0n2 or whatever later, and also we don't need to use more
numbering space.


Oh, wait, I have just noticed that you have reused an entry in this
table!

       1 << 28 | disk << 8 | partition   xvd, disks or partitions 16 onwards
...
> +    1 << 28 | disk << 8               nvme, all disks, whole disk only

(I didn't spot this before because the patch didn't contain the first
of those two lines as context.)


Ian.
Paul Durrant March 22, 2017, 3:21 p.m. UTC | #4
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 15:02
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for
> emulated NVMe drives"):
> > [Ian:]
> > > If and when qemu supports multiple namespaces, how will this be
> > > specified ?
> >
> > I'd guess a vdev of the form nvmeXnY would follow convention.
> 
> Right.  Sorry, I was unclear.  I meant here:
> 
>  +    1 << 28 | disk << 8                  nvme, all disks, whole disk only
> 
> Maybe we should leave some room in this bit pattern.  If we do that
> then guests only need to be told about this once, even if we support
> nvme0n2 or whatever later, and also we don't need to use more
> numbering space.
> 
> 
> Oh, wait, I have just noticed that you have reused an entry in this
> table!
> 
>        1 << 28 | disk << 8 | partition   xvd, disks or partitions 16 onwards
> ...
> > +    1 << 28 | disk << 8               nvme, all disks, whole disk only
> 
> (I didn't spot this before because the patch didn't contain the first
> of those two lines as context.)
> 

Yes, that was intentional. No need for a new concrete encoding if we ignore namespaces, as I said in the commit comment. If you want to support namespaces then it would need something new... which doesn't seem worth it since QEMU has no support.

  Paul

> 
> Ian.
Ian Jackson March 22, 2017, 4:03 p.m. UTC | #5
Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> > Oh, wait, I have just noticed that you have reused an entry in this
> > table!
> > 
> >        1 << 28 | disk << 8 | partition   xvd, disks or partitions 16 onwards
> > ...
> > > +    1 << 28 | disk << 8               nvme, all disks, whole disk only
> 
> Yes, that was intentional. No need for a new concrete encoding if we
> ignore namespaces, as I said in the commit comment. If you want to
> support namespaces then it would need something new... which doesn't
> seem worth it since QEMU has no support.

That's not my point.  The purpose of this table is to advise guests
what the conventional in-guest device name ought to be for a certain
vbd.

We can't have one single numerical encoding mapping to two device
names.  That makes no sense.

Presumably these NVME devices should be subject to the same vbd and
unplug approach as scsi and ide disks.  In which case guests need to
be told that the device name for the vbd should be "nvme<something>",
which can only be done by using a different number.

Ian.
Paul Durrant March 22, 2017, 4:31 p.m. UTC | #6
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 16:03
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for
> emulated NVMe drives"):
> > > Oh, wait, I have just noticed that you have reused an entry in this
> > > table!
> > >
> > >        1 << 28 | disk << 8 | partition   xvd, disks or partitions 16 onwards
> > > ...
> > > > +    1 << 28 | disk << 8               nvme, all disks, whole disk only
> >
> > Yes, that was intentional. No need for a new concrete encoding if we
> > ignore namespaces, as I said in the commit comment. If you want to
> > support namespaces then it would need something new... which doesn't
> > seem worth it since QEMU has no support.
> 
> That's not my point.  The purpose of this table is to advise guests
> what the conventional in-guest device name ought to be for a certain
> vbd.

Yes, and xvd<something> is a perfectly fine name for a PV device in pretty much every case. It's already the case that emulated IDE disks are exposed to guests using xvd* numbering.

> 
> We can't have one single numerical encoding mapping to two device
> names.  That makes no sense.
> 
> Presumably these NVME devices should be subject to the same vbd and
> unplug approach as scsi and ide disks.

Yes, that's what the QEMU patch does.

> In which case guests need to
> be told that the device name for the vbd should be "nvme<something>",
> which can only be done by using a different number.
> 

That means modifications to PV frontends would be needed, which is going to make things more difficult. Most OS find disks by UUID these days anyway so I'm still not sure that just using xvd* numbering would really be a problem.

  Paul

> Ian.
Paul Durrant March 22, 2017, 4:45 p.m. UTC | #7
> -----Original Message-----

> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of

> Paul Durrant

> Sent: 22 March 2017 16:32

> To: Ian Jackson <Ian.Jackson@citrix.com>

> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>

> Subject: Re: [Xen-devel] [PATCH RESEND] tools/libxl: add support for

> emulated NVMe drives

> 

> > -----Original Message-----

> > From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]

> > Sent: 22 March 2017 16:03

> > To: Paul Durrant <Paul.Durrant@citrix.com>

> > Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>

> > Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe

> > drives

> >

> > Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for

> > emulated NVMe drives"):

> > > > Oh, wait, I have just noticed that you have reused an entry in this

> > > > table!

> > > >

> > > >        1 << 28 | disk << 8 | partition   xvd, disks or partitions 16 onwards

> > > > ...

> > > > > +    1 << 28 | disk << 8               nvme, all disks, whole disk only

> > >

> > > Yes, that was intentional. No need for a new concrete encoding if we

> > > ignore namespaces, as I said in the commit comment. If you want to

> > > support namespaces then it would need something new... which doesn't

> > > seem worth it since QEMU has no support.

> >

> > That's not my point.  The purpose of this table is to advise guests

> > what the conventional in-guest device name ought to be for a certain

> > vbd.

> 

> Yes, and xvd<something> is a perfectly fine name for a PV device in pretty

> much every case. It's already the case that emulated IDE disks are exposed to

> guests using xvd* numbering.

> 

> >

> > We can't have one single numerical encoding mapping to two device

> > names.  That makes no sense.

> >

> > Presumably these NVME devices should be subject to the same vbd and

> > unplug approach as scsi and ide disks.

> 

> Yes, that's what the QEMU patch does.

> 

> > In which case guests need to

> > be told that the device name for the vbd should be "nvme<something>",

> > which can only be done by using a different number.

> >

> 

> That means modifications to PV frontends would be needed, which is going

> to make things more difficult. Most OS find disks by UUID these days anyway

> so I'm still not sure that just using xvd* numbering would really be a problem.

> 


Also, if we're going to go for distinct numbering then I guess the QEMU patch is probably wrong because we'd want guests with non-aware PV frontend to leave emulated NVMe devices plugged in.

  Paul

>   Paul

> 

> > Ian.

> 

> _______________________________________________

> Xen-devel mailing list

> Xen-devel@lists.xen.org

> https://lists.xen.org/xen-devel
Ian Jackson March 22, 2017, 5:02 p.m. UTC | #8
Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> > From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> > That's not my point.  The purpose of this table is to advise guests
> > what the conventional in-guest device name ought to be for a certain
> > vbd.
> 
> Yes, and xvd<something> is a perfectly fine name for a PV device in pretty much every case. It's already the case that emulated IDE disks are exposed to guests using xvd* numbering.

No, I don't think so:

/libxl/5/device/vbd/5632/params = "aio:/root/68254.test-amd64-amd64-xl-qemuu-debianhvm-amd64.debianhvm-em\..."
(n0)

5632 = 22 << 8 | 0 ie "hd, disk 2, partition 0"

Some operating systems (including many recent Linux kernels) present
all vbds as xvd*.

> > Presumably these NVME devices should be subject to the same vbd and
> > unplug approach as scsi and ide disks.
> 
> Yes, that's what the QEMU patch does.

So maybe they should reuse the hd* numbering ?

> That means modifications to PV frontends would be needed, which is
> going to make things more difficult. Most OS find disks by UUID
> these days anyway so I'm still not sure that just using xvd*
> numbering would really be a problem.

In terms of the "nominal disk type" discussed in
xen-vbd-interface.markdown.7, I don't think these emulated devices,
which get unplugged, should be have a "nomainl disk type" of "Xen
virtual disk".

Ian.
Paul Durrant March 22, 2017, 5:16 p.m. UTC | #9
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 17:03
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for
> emulated NVMe drives"):
> > > From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> > > That's not my point.  The purpose of this table is to advise guests
> > > what the conventional in-guest device name ought to be for a certain
> > > vbd.
> >
> > Yes, and xvd<something> is a perfectly fine name for a PV device in pretty
> much every case. It's already the case that emulated IDE disks are exposed to
> guests using xvd* numbering.
> 
> No, I don't think so:
> 
> /libxl/5/device/vbd/5632/params = "aio:/root/68254.test-amd64-amd64-xl-
> qemuu-debianhvm-amd64.debianhvm-em\..."
> (n0)
> 
> 5632 = 22 << 8 | 0 ie "hd, disk 2, partition 0"
> 

This is my VM:

root@brixham:~# xenstore-ls "/libxl/3"
device = ""
 vbd = ""
  51712 = ""
   frontend = "/local/domain/3/device/vbd/51712"
   backend = "/local/domain/0/backend/qdisk/3/51712"
   params = "qcow2:/root/winrs2-pv1.qcow2"
   frontend-id = "3"
   online = "1"
   removable = "0"
   bootable = "1"
   state = "1"
   dev = "xvda"
   type = "qdisk"
   mode = "w"
   device-type = "disk"
   discard-enable = "1"

No problem using xvda... still ends up as IDE primary master.

> Some operating systems (including many recent Linux kernels) present
> all vbds as xvd*.
> 
> > > Presumably these NVME devices should be subject to the same vbd and
> > > unplug approach as scsi and ide disks.
> >
> > Yes, that's what the QEMU patch does.
> 
> So maybe they should reuse the hd* numbering ?
> 

That might be too limiting. The hd* numbering scheme doesn't stretch very far.

> > That means modifications to PV frontends would be needed, which is
> > going to make things more difficult. Most OS find disks by UUID
> > these days anyway so I'm still not sure that just using xvd*
> > numbering would really be a problem.
> 
> In terms of the "nominal disk type" discussed in
> xen-vbd-interface.markdown.7, I don't think these emulated devices,
> which get unplugged, should be have a "nomainl disk type" of "Xen
> virtual disk".
> 

Ok. I'll submit another patch to QEMU to distinguish between IDE/SCSI disks and NVMe disks in the unplug protocol, come up with a new PV numbering schemed and modify the Windows frontend to understand it.

  Paul

> Ian.
Ian Jackson March 22, 2017, 5:31 p.m. UTC | #10
Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> This is my VM:
> 
> root@brixham:~# xenstore-ls "/libxl/3"
> device = ""
>  vbd = ""
>   51712 = ""
...
>    params = "qcow2:/root/winrs2-pv1.qcow2"

> No problem using xvda... still ends up as IDE primary master.

Right.  The question is more whether this confuses the guest.  I don't
think the tools will actually mind.

I guess that was with xapi rather than libxl ?

> > So maybe they should reuse the hd* numbering ?
> 
> That might be too limiting. The hd* numbering scheme doesn't stretch
> very far.

Indeed.  sd is rather limited too.

But, you say:

      Also, current versions of SeaBIOS do not support booting from
      NVMe devices, so the vdev should only be used for secondary drives.

So currently this is mostly useful for testing ?

Normally the emulated devices are _intended_ for bootstrapping to an
environment that can handle vbds.  Which doesn't involve having very
many of them.

> > > That means modifications to PV frontends would be needed, which is
> > > going to make things more difficult. Most OS find disks by UUID
> > > these days anyway so I'm still not sure that just using xvd*
> > > numbering would really be a problem.
> > 
> > In terms of the "nominal disk type" discussed in
> > xen-vbd-interface.markdown.7, I don't think these emulated devices,
> > which get unplugged, should be have a "nomainl disk type" of "Xen
> > virtual disk".
> 
> Ok. I'll submit another patch to QEMU to distinguish between
> IDE/SCSI disks and NVMe disks in the unplug protocol, come up with a
> new PV numbering schemed and modify the Windows frontend to
> understand it.

Before you go away and do a lot of work, perhaps we should keep
exploring whether my concerns are actualliy justified...

Allocating a new numbering scheme might involve changing Linux guests
too.  (I haven't experimented with what happens if one specifies a
reserved number.)

Ian.
Paul Durrant March 22, 2017, 5:41 p.m. UTC | #11
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 17:32
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for
> emulated NVMe drives"):
> > This is my VM:
> >
> > root@brixham:~# xenstore-ls "/libxl/3"
> > device = ""
> >  vbd = ""
> >   51712 = ""
> ...
> >    params = "qcow2:/root/winrs2-pv1.qcow2"
> 
> > No problem using xvda... still ends up as IDE primary master.
> 
> Right.  The question is more whether this confuses the guest.  I don't
> think the tools will actually mind.
> 
> I guess that was with xapi rather than libxl ?

Nope. It was libxl.

Windows PV drivers treat hd*, sd* and xvd* numbering in the same way... they just parse the disk number out and use that as the target number of the synthetic SCSI bus exposed to Windows.

> 
> > > So maybe they should reuse the hd* numbering ?
> >
> > That might be too limiting. The hd* numbering scheme doesn't stretch
> > very far.
> 
> Indeed.  sd is rather limited too.
> 
> But, you say:
> 
>       Also, current versions of SeaBIOS do not support booting from
>       NVMe devices, so the vdev should only be used for secondary drives.
> 
> So currently this is mostly useful for testing ?

Yes. Just for testing at the moment.

> 
> Normally the emulated devices are _intended_ for bootstrapping to an
> environment that can handle vbds.  Which doesn't involve having very
> many of them.
> 
> > > > That means modifications to PV frontends would be needed, which is
> > > > going to make things more difficult. Most OS find disks by UUID
> > > > these days anyway so I'm still not sure that just using xvd*
> > > > numbering would really be a problem.
> > >
> > > In terms of the "nominal disk type" discussed in
> > > xen-vbd-interface.markdown.7, I don't think these emulated devices,
> > > which get unplugged, should be have a "nomainl disk type" of "Xen
> > > virtual disk".
> >
> > Ok. I'll submit another patch to QEMU to distinguish between
> > IDE/SCSI disks and NVMe disks in the unplug protocol, come up with a
> > new PV numbering schemed and modify the Windows frontend to
> > understand it.
> 
> Before you go away and do a lot of work, perhaps we should keep
> exploring whether my concerns are actualliy justified...
> 
> Allocating a new numbering scheme might involve changing Linux guests
> too.  (I haven't experimented with what happens if one specifies a
> reserved number.)
> 

Yes, that's a point. IIRC the doc does say that guests should ignore numbers they don't understand... but who knows if this is actually the case.

Given that there's no booting from NVMe at the moment, even HVM linux will only ever see the PV device since the emulated device will be unplugged early in boot and PV drivers are 'in box' in Linux. Windows is really the concern, where PV drivers are installed after the OS has seen the emulated device and thus the PV device needs to appear with the same 'identity' as far as the storage stack is concerned. I'm pretty sure this worked when I tried it a few months back using xvd* numbering (while coming up with the QEMU patch) but I'll check again.

  Paul

> Ian.
Ian Jackson March 22, 2017, 5:48 p.m. UTC | #12
Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe drives"):
> > I guess that was with xapi rather than libxl ?
> 
> Nope. It was libxl.

That's weird.  You specify it as xvda in the config file ?

> Windows PV drivers treat hd*, sd* and xvd* numbering in the same way... they just parse the disk number out and use that as the target number of the synthetic SCSI bus exposed to Windows.

What if there's an hda /and/ and an xvda ?

> > Allocating a new numbering scheme might involve changing Linux guests
> > too.  (I haven't experimented with what happens if one specifies a
> > reserved number.)
> 
> Yes, that's a point. IIRC the doc does say that guests should ignore numbers they don't understand... but who knows if this is actually the case.
> 
> Given that there's no booting from NVMe at the moment, even HVM linux will only ever see the PV device since the emulated device will be unplugged early in boot and PV drivers are 'in box' in Linux. Windows is really the concern, where PV drivers are installed after the OS has seen the emulated device and thus the PV device needs to appear with the same 'identity' as far as the storage stack is concerned. I'm pretty sure this worked when I tried it a few months back using xvd* numbering (while coming up with the QEMU patch) but I'll check again.

I guess I'm trying to look forward to a "real" use case, which is
presumably emulated NVME booting ?

If it's just for testing we might not care about a low limit on the
number of devices, or the precise unplug behaviour.  Or we might
tolerate having such tests require special configuration.

Ian.
Paul Durrant March 23, 2017, 8:55 a.m. UTC | #13
> -----Original Message-----
> From: Ian Jackson [mailto:ian.jackson@eu.citrix.com]
> Sent: 22 March 2017 17:48
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: xen-devel@lists.xenproject.org; Wei Liu <wei.liu2@citrix.com>
> Subject: RE: [PATCH RESEND] tools/libxl: add support for emulated NVMe
> drives
> 
> Paul Durrant writes ("RE: [PATCH RESEND] tools/libxl: add support for
> emulated NVMe drives"):
> > > I guess that was with xapi rather than libxl ?
> >
> > Nope. It was libxl.
> 
> That's weird.  You specify it as xvda in the config file ?
> 

Yes. You'd think that specifying xvda would mean no emulated device but, no, it appears as IDE.

> > Windows PV drivers treat hd*, sd* and xvd* numbering in the same way...
> they just parse the disk number out and use that as the target number of the
> synthetic SCSI bus exposed to Windows.
> 
> What if there's an hda /and/ and an xvda ?
> 

libxl: error: libxl_dm.c:2345:device_model_spawn_outcome: Domain 1:domain 1 device model: spawn failed (rc=-3)
libxl: error: libxl_create.c:1493:domcreate_devmodel_started: Domain 1:device model did not start: -3
libxl: error: libxl_dm.c:2459:kill_device_model: Device Model already exited
libxl: error: libxl_dom.c:38:libxl__domain_type: unable to get domain type for domid=1
libxl: error: libxl_domain.c:962:domain_destroy_callback: Domain 1:Unable to destroy guest
libxl: error: libxl_domain.c:889:domain_destroy_cb: Domain 1:Destruction of domain failed
root@brixham:~# tail -F /var/log/xen/qemu-dm-winrs2-1.hvm.log
qemu-system-i386:/root/events:12: WARNING: trace event 'xen_domid_restrict' does not exist
qemu-system-i386: -drive file=/root/disk.qcow2,if=ide,index=0,media=disk,format=qcow2,cache=writeback: drive with bus=0, unit=0 (index=0) exists

However, no such failure occurs is I choose 'nvme0' for my secondary disk so it is unsafe to re-use xvd* numbering without at least further modification to libxl to make sure that there is only ever one disk N, whatever number scheme is used.

> > > Allocating a new numbering scheme might involve changing Linux guests
> > > too.  (I haven't experimented with what happens if one specifies a
> > > reserved number.)
> >
> > Yes, that's a point. IIRC the doc does say that guests should ignore numbers
> they don't understand... but who knows if this is actually the case.
> >
> > Given that there's no booting from NVMe at the moment, even HVM linux
> will only ever see the PV device since the emulated device will be unplugged
> early in boot and PV drivers are 'in box' in Linux. Windows is really the
> concern, where PV drivers are installed after the OS has seen the emulated
> device and thus the PV device needs to appear with the same 'identity' as far
> as the storage stack is concerned. I'm pretty sure this worked when I tried it a
> few months back using xvd* numbering (while coming up with the QEMU
> patch) but I'll check again.
> 
> I guess I'm trying to look forward to a "real" use case, which is
> presumably emulated NVME booting ?
> 
> If it's just for testing we might not care about a low limit on the
> number of devices, or the precise unplug behaviour.  Or we might
> tolerate having such tests require special configuration.
> 

The potential use for NVMe in the long run is actually to avoid using PV at all. QEMU's emulation of NVMe is not as fast as QEMU acting as a PV backend, but it's not that far off, and the advantage is that NVMe is a standard and thus Windows has an in-box driver. So, having thought more about it, we definitely should separate NVMe devices from IDE/SCSI devices in the unplug protocol and - should a PV frontend choose to displace emulated NVMe - it does indeed need to be able to distinguish them.

I'll post a patch to QEMU today to revise the unplug protocol and I'll check what happens when blkfront encounters a vbd number it doesn't understand.

  Paul

> Ian.
diff mbox

Patch

diff --git a/docs/man/xen-vbd-interface.markdown.7 b/docs/man/xen-vbd-interface.markdown.7
index 1c996bf..8fd378c 100644
--- a/docs/man/xen-vbd-interface.markdown.7
+++ b/docs/man/xen-vbd-interface.markdown.7
@@ -8,12 +8,12 @@  emulated IDE, AHCI or SCSI disks.
 The abstract interface involves specifying, for each block device:
 
  * Nominal disk type: Xen virtual disk (aka xvd*, the default); SCSI
-   (sd*); IDE or AHCI (hd*).
+   (sd*); IDE or AHCI (hd*); NVMe.
 
-   For HVM guests, each whole-disk hd* and and sd* device is made
-   available _both_ via emulated IDE resp. SCSI controller, _and_ as a
-   Xen VBD.  The HVM guest is entitled to assume that the IDE or SCSI
-   disks available via the emulated IDE controller target the same
+   For HVM guests, each whole-disk hd*, sd* or nvme* device is made
+   available _both_ via emulated IDE, SCSI controller or NVMe drive
+   respectively _and_ as a Xen VBD.  The HVM guest is entitled to
+   assume that the disks available via the emulation target the same
    underlying devices as the corresponding Xen VBD (ie, multipath).
    In hd* case with hdtype=ahci, disk will be AHCI via emulated
    ich9 disk controller.
@@ -42,8 +42,7 @@  The abstract interface involves specifying, for each block device:
    treat each vbd as it would a partition or slice or LVM volume (for
    example by putting or expecting a filesystem on it).
 
-   Non-whole disk devices cannot be passed through to HVM guests via
-   the emulated IDE or SCSI controllers.
+   Only whole disk devices can be emulated for HVM guests.
 
 
 Configuration file syntax
@@ -56,6 +55,7 @@  The config file syntaxes are, for example
        d536p37  xvdtq37  Xen virtual disk 536 partition 37
        sdb3              SCSI disk 1 partition 3
        hdc2              IDE disk 2 partition 2
+       nvme0             NVMe disk 0 (whole disk only)
 
 The d*p* syntax is not supported by xm/xend.
 
@@ -78,6 +78,7 @@  encodes the information above as follows:
      8 << 8 | disk << 4 | partition      sd, disks and partitions up to 15
      3 << 8 | disk << 6 | partition      hd, disks 0..1, partitions 0..63
     22 << 8 | (disk-2) << 6 | partition  hd, disks 2..3, partitions 0..63
+    1 << 28 | disk << 8                  nvme, all disks, whole disk only
     2 << 28 onwards                      reserved for future use
    other values less than 1 << 28        deprecated / reserved
 
diff --git a/docs/man/xl-disk-configuration.pod.5 b/docs/man/xl-disk-configuration.pod.5
index d3eedc1..c40418e 100644
--- a/docs/man/xl-disk-configuration.pod.5
+++ b/docs/man/xl-disk-configuration.pod.5
@@ -127,8 +127,8 @@  designation in some specifications).  L<xen-vbd-interface(7)>
 
 =item Supported values
 
-hd[x], xvd[x], sd[x] etc.  Please refer to the above specification for
-further details.
+hd[x], xvd[x], sd[x], nvme[x] etc.  Please refer to the above specification
+for further details.
 
 =item Deprecated values
 
diff --git a/tools/libxl/libxl_device.c b/tools/libxl/libxl_device.c
index 5e96676..bd06904 100644
--- a/tools/libxl/libxl_device.c
+++ b/tools/libxl/libxl_device.c
@@ -532,6 +532,14 @@  int libxl__device_disk_dev_number(const char *virtpath, int *pdisk,
         if (ppartition) *ppartition = partition;
         return (8 << 8) | (disk << 4) | partition;
     }
+    if (!memcmp(virtpath, "nvme", 4)) {
+        disk = strtoul(virtpath + 4, &ep, 10);
+        if (*ep)
+            return -1;
+        if (pdisk) *pdisk = disk;
+        if (ppartition) *ppartition = 0;
+        return (1 << 28) | (disk << 8);
+    }
     return -1;
 }
 
diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index 4344c53..9efb4b7 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1568,6 +1568,12 @@  static int libxl__build_device_model_args_new(libxl__gc *gc,
                                                         format,
                                                         &disks[i],
                                                         colo_mode);
+                } else if (strncmp(disks[i].vdev, "nvme", 4) == 0) {
+                    flexarray_vappend(dm_args,
+                        "-drive",  GCSPRINTF("file=%s,if=none,id=nvmedisk-%d,format=%s,cache=writeback", target_path, disk, format),
+                        "-device", GCSPRINTF("nvme,drive=nvmedisk-%d,serial=%d", disk, disk),
+                        NULL);
+                    continue;
                 } else if (disk < 6 && b_info->u.hvm.hdtype == LIBXL_HDTYPE_AHCI) {
                     if (!disks[i].readwrite) {
                         LOGD(ERROR, guest_domid,