mbox series

[RFC,00/39] x86/KVM: Xen HVM guest support

Message ID 20190220201609.28290-1-joao.m.martins@oracle.com (mailing list archive)
Headers show
Series x86/KVM: Xen HVM guest support | expand

Message

Joao Martins Feb. 20, 2019, 8:15 p.m. UTC
Hey,

Presented herewith a series that allows KVM to boot Xen x86 HVM guests (with their
respective frontends and backends). On the hypervisor side, the approach is to keep
the implementation similar to how HyperV was done on x86 KVM. On the backend driver
side, the intent is to reuse Xen support.

Note that this is an RFC so there are bugs and room for improvement. The intent
is to get overall feedback before proceeding further.

Running Xen guests on KVM, enables the following use-cases:

 * Run unmodified Xen HVM images;

 * Facilitate development/testing of Xen guests and Xen PV drivers;

There has been a similar proposal in the past with Xenner, albeit this work
has the following advantages over it:

 * Allows use of existing Xen PV drivers such as block, net etc, both as
 frontends and backends;

 * Xen tooling will see the same UABI as on Xen. This means things like
 xenstored, xenstore-list, xenstore-read run unmodified. Optionally,
 userspace VMM can emulate xenstore operations;

 This work is divided in two parts:

 1. Xen HVM ABI support (patches 1 - 16)

 Support the necessary mechanisms to allow HVM guests to
 boot *without* PV frontends exposed to the guest.
 
 We start by intercepting hypercalls made by the guest, followed
 by event channel IPIs/VIRQs (for PV IPI, timers, spinlocks),
 pvclock and steal clock.

	Ankur Arora (1):
	      KVM: x86/xen: support upcall vector

	Boris Ostrovsky (1):
	      KVM: x86/xen: handle PV spinlocks slowpath

	Joao Martins (14):
	      KVM: x86: fix Xen hypercall page msr handling
	      KVM: x86/xen: intercept xen hypercalls if enabled
	      KVM: x86/xen: register shared_info page
	      KVM: x86/xen: setup pvclock updates
	      KVM: x86/xen: update wallclock region
	      KVM: x86/xen: register vcpu info
	      KVM: x86/xen: register vcpu time info region
	      KVM: x86/xen: register steal clock
	      KVM: x86: declare Xen HVM guest capability
	      KVM: x86/xen: evtchn signaling via eventfd
	      KVM: x86/xen: store virq when assigning evtchn
	      KVM: x86/xen: handle PV timers oneshot mode
	      KVM: x86/xen: handle PV IPI vcpu yield
	      KVM: x86: declare Xen HVM evtchn offload capability
	 
	 Documentation/virtual/kvm/api.txt |   23 ++
	 arch/x86/include/asm/kvm_host.h   |   46 ++++
	 arch/x86/kvm/Makefile             |    2 +-
	 arch/x86/kvm/irq.c                |   25 ++-
	 arch/x86/kvm/irq_comm.c           |   11 +
	 arch/x86/kvm/trace.h              |   33 +++
	 arch/x86/kvm/x86.c                |   60 +++++-
	 arch/x86/kvm/x86.h                |    1 +
	 arch/x86/kvm/xen.c                | 1025 +++++++++++++++++++++++++++++
	 arch/x86/kvm/xen.h                |   48 +++++
	 include/linux/kvm_host.h          |   24 +++
	 include/uapi/linux/kvm.h          |   73 ++++++-
	 12 files changed, 1361 insertions(+), 10 deletions(-)


 2. PV Driver support (patches 17 - 39)

 We start by redirecting hypercalls from the backend to routines
 which emulate the behaviour that PV backends expect i.e. grant
 table and interdomain events. Next, we add support for late
 initialization of xenbus, followed by implementing
 frontend/backend communication mechanisms (i.e. grant tables and
 interdomain event channels). Finally, introduce xen-shim.ko,
 which will setup a limited Xen environment. This uses the added
 functionality of Xen specific shared memory (grant tables) and
 notifications (event channels).

 Note that patch 19 is useful to Xen on its own.

	 Ankur Arora (11):
	      x86/xen: export vcpu_info and shared_info
	      x86/xen: make hypercall_page generic
	      KVM: x86/xen: backend hypercall support
	      KVM: x86/xen: grant map support
	      KVM: x86/xen: grant unmap support
	      KVM: x86/xen: interdomain evtchn support
	      KVM: x86/xen: evtchn unmask support
	      KVM: x86/xen: add additional evtchn ops
	      xen-shim: introduce shim domain driver
	      xen/gntdev: xen_shim_domain() support
	      xen/xenbus: xen_shim_domain() support

	Joao Martins (12):
	      xen/xenbus: xenbus uninit support
	      xen-blkback: module_exit support
	      KVM: x86/xen: domid allocation
	      KVM: x86/xen: grant table init
	      KVM: x86/xen: grant table grow support
	      KVM: x86/xen: grant copy support
	      xen/balloon: xen_shim_domain() support
	      xen/grant-table: xen_shim_domain() support
	      drivers/xen: xen_shim_domain() support
	      xen-netback: xen_shim_domain() support
	      xen-blkback: xen_shim_domain() support
	      KVM: x86: declare Xen HVM Dom0 capability

	[See the entire series diffstat at the end]

There are additional Qemu patches to take advantage of this (and
they are available here[0][1]). An example on how you could run it would be:

$ ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -serial mon:stdio \
  -machine xenfv,accel=kvm \
  -cpu host,-kvm,+xen,xen-major-version=4,xen-minor-version=4,+xen-vapic,+xen-pvclock \
  -kernel /path/to/kernel -m 16G -smp 16,sockets=1,cores=16,threads=1,maxcpus=16 \
  -append "earlyprintk=ttyS0 console=tty0 console=ttyS0" \
  -device xen-platform -device usb-ehci -device usb-tablet,bus=usb-bus.0 -vnc :0 -k pt \
  -netdev type=tap,id=net1,ifname=vif1.0,script=qemu-ifup \
  -device xen-nic,netdev=net1,mac=52:54:00:12:34:56,backendtype=vif,backend=0 \
  -drive file=/path/to/image.img,format=raw,id=legacy,if=ide \
  -blockdev file,node-name=drive,filename=/path/to/image.img,locking=off \
  -device xen-disk,vdev=xvda,drive=drive,backendtype=vbd

Naturally other options are still at your disposal (e.g. booting with q35
platform, or with virtio, etc).

Thoughts? 

Cheers,
	Joao

[0] https://www.github.com/jpemartins/qemu xen-shim-rfc
[1] https://www.github.com/jpemartins/linux xen-shim-rfc

 Documentation/virtual/kvm/api.txt        |   33 +
 arch/x86/include/asm/kvm_host.h          |   88 ++
 arch/x86/include/asm/xen/hypercall.h     |   12 +-
 arch/x86/kvm/Kconfig                     |   10 +
 arch/x86/kvm/Makefile                    |    3 +-
 arch/x86/kvm/irq.c                       |   25 +-
 arch/x86/kvm/irq_comm.c                  |   11 +
 arch/x86/kvm/trace.h                     |   33 +
 arch/x86/kvm/x86.c                       |   66 +-
 arch/x86/kvm/x86.h                       |    1 +
 arch/x86/kvm/xen-asm.S                   |   66 +
 arch/x86/kvm/xen-shim.c                  |  138 ++
 arch/x86/kvm/xen.c                       | 2262 ++++++++++++++++++++++++++++++
 arch/x86/kvm/xen.h                       |   55 +
 arch/x86/xen/enlighten.c                 |   49 +
 arch/x86/xen/enlighten_hvm.c             |    3 +-
 arch/x86/xen/enlighten_pv.c              |    1 +
 arch/x86/xen/enlighten_pvh.c             |    3 +-
 arch/x86/xen/xen-asm_32.S                |    2 +-
 arch/x86/xen/xen-asm_64.S                |    2 +-
 arch/x86/xen/xen-head.S                  |    8 +-
 drivers/block/xen-blkback/blkback.c      |   27 +-
 drivers/block/xen-blkback/common.h       |    2 +
 drivers/block/xen-blkback/xenbus.c       |   19 +-
 drivers/net/xen-netback/netback.c        |   25 +-
 drivers/xen/balloon.c                    |   15 +-
 drivers/xen/events/events_2l.c           |    4 +-
 drivers/xen/events/events_base.c         |    6 +-
 drivers/xen/events/events_fifo.c         |    2 +-
 drivers/xen/evtchn.c                     |    4 +-
 drivers/xen/features.c                   |    1 +
 drivers/xen/gntdev.c                     |   10 +-
 drivers/xen/grant-table.c                |   15 +-
 drivers/xen/privcmd.c                    |    5 +-
 drivers/xen/xenbus/xenbus.h              |    2 +
 drivers/xen/xenbus/xenbus_client.c       |   28 +-
 drivers/xen/xenbus/xenbus_dev_backend.c  |    4 +-
 drivers/xen/xenbus/xenbus_dev_frontend.c |    6 +-
 drivers/xen/xenbus/xenbus_probe.c        |   57 +-
 drivers/xen/xenbus/xenbus_xs.c           |   40 +-
 drivers/xen/xenfs/super.c                |    7 +-
 drivers/xen/xenfs/xensyms.c              |    4 +
 include/linux/kvm_host.h                 |   24 +
 include/uapi/linux/kvm.h                 |  104 +-
 include/xen/balloon.h                    |    7 +
 include/xen/xen-ops.h                    |    2 +-
 include/xen/xen.h                        |    5 +
 include/xen/xenbus.h                     |    3 +
 48 files changed, 3232 insertions(+), 67 deletions(-)
 create mode 100644 arch/x86/kvm/xen-asm.S
 create mode 100644 arch/x86/kvm/xen-shim.c
 create mode 100644 arch/x86/kvm/xen.c
 create mode 100644 arch/x86/kvm/xen.h

Comments

Paolo Bonzini Feb. 20, 2019, 9:09 p.m. UTC | #1
On 20/02/19 21:15, Joao Martins wrote:
>  2. PV Driver support (patches 17 - 39)
> 
>  We start by redirecting hypercalls from the backend to routines
>  which emulate the behaviour that PV backends expect i.e. grant
>  table and interdomain events. Next, we add support for late
>  initialization of xenbus, followed by implementing
>  frontend/backend communication mechanisms (i.e. grant tables and
>  interdomain event channels). Finally, introduce xen-shim.ko,
>  which will setup a limited Xen environment. This uses the added
>  functionality of Xen specific shared memory (grant tables) and
>  notifications (event channels).

I am a bit worried by the last patches, they seem really brittle and
prone to breakage.  I don't know Xen well enough to understand if the
lack of support for GNTMAP_host_map is fixable, but if not, you have to
define a completely different hypercall.

Of course, tests are missing.  You should use the
tools/testing/selftests/kvm/ framework, and ideally each patch should
come with coverage for the newly-added code.

Thanks,

Paolo
Marek Marczykowski-Górecki Feb. 20, 2019, 11:39 p.m. UTC | #2
On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>  2. PV Driver support (patches 17 - 39)
> 
>  We start by redirecting hypercalls from the backend to routines
>  which emulate the behaviour that PV backends expect i.e. grant
>  table and interdomain events. Next, we add support for late
>  initialization of xenbus, followed by implementing
>  frontend/backend communication mechanisms (i.e. grant tables and
>  interdomain event channels). Finally, introduce xen-shim.ko,
>  which will setup a limited Xen environment. This uses the added
>  functionality of Xen specific shared memory (grant tables) and
>  notifications (event channels).

Does it mean backends could be run in another guest, similarly as on
real Xen? AFAIK virtio doesn't allow that as virtio backends need
arbitrary write access to guest memory. But grant tables provide enough
abstraction to do that safely.
Ankur Arora Feb. 21, 2019, 12:29 a.m. UTC | #3
On 2/20/19 1:09 PM, Paolo Bonzini wrote:
> On 20/02/19 21:15, Joao Martins wrote:
>>   2. PV Driver support (patches 17 - 39)
>>
>>   We start by redirecting hypercalls from the backend to routines
>>   which emulate the behaviour that PV backends expect i.e. grant
>>   table and interdomain events. Next, we add support for late
>>   initialization of xenbus, followed by implementing
>>   frontend/backend communication mechanisms (i.e. grant tables and
>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>   which will setup a limited Xen environment. This uses the added
>>   functionality of Xen specific shared memory (grant tables) and
>>   notifications (event channels).
> 
> I am a bit worried by the last patches, they seem really brittle and
> prone to breakage.  I don't know Xen well enough to understand if the
> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> define a completely different hypercall.
I assume you are aware of most of this, but just in case, here's the
flow when a backend driver wants to map a grant-reference in the
host: it allocates an empty struct page (via ballooning) and does a
map_grant_ref(GNTMAP_host_map) hypercall. In response, Xen validates the
grant-reference and maps it onto the address associated with the struct
page.
After this, from the POV of the underlying network/block drivers, these
struct pages can be used as just regular pages.

To support this in a KVM environment, where AFAICS no remapping of pages
is possible, the idea was to make minimal changes to the backend drivers
such that map_grant_ref() could just return the PFN from which the
backend could derive the struct page.

To ensure that backends -- when running in this environment -- have been
modified to deal with these new semantics, our map_grant_ref()
implementation explicitly disallows the GNTMAP_host_map flag.

Now if I'm reading you right, you would prefer something more
straightforward -- perhaps similar semantics but a new flag that
makes this behaviour explicit?

> 
> Of course, tests are missing.  You should use the
> tools/testing/selftests/kvm/ framework, and ideally each patch should
> come with coverage for the newly-added code.
Agreed.

Thanks
Ankur

> 
> Thanks,
> 
> Paolo
>
Ankur Arora Feb. 21, 2019, 12:31 a.m. UTC | #4
On 2/20/19 3:39 PM, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>>   2. PV Driver support (patches 17 - 39)
>>
>>   We start by redirecting hypercalls from the backend to routines
>>   which emulate the behaviour that PV backends expect i.e. grant
>>   table and interdomain events. Next, we add support for late
>>   initialization of xenbus, followed by implementing
>>   frontend/backend communication mechanisms (i.e. grant tables and
>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>   which will setup a limited Xen environment. This uses the added
>>   functionality of Xen specific shared memory (grant tables) and
>>   notifications (event channels).
> 
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
I'm afraid not. For now grant operations (map/unmap) can only be done
by backends to the local KVM instance.

Ankur

> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.

>
Jürgen Groß Feb. 21, 2019, 7:57 a.m. UTC | #5
On 21/02/2019 00:39, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>>  2. PV Driver support (patches 17 - 39)
>>
>>  We start by redirecting hypercalls from the backend to routines
>>  which emulate the behaviour that PV backends expect i.e. grant
>>  table and interdomain events. Next, we add support for late
>>  initialization of xenbus, followed by implementing
>>  frontend/backend communication mechanisms (i.e. grant tables and
>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>  which will setup a limited Xen environment. This uses the added
>>  functionality of Xen specific shared memory (grant tables) and
>>  notifications (event channels).
> 
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.

As long as the grant table emulation in xen-shim isn't just a wrapper to
"normal" KVM guest memory access.

I guess the xen-shim implementation doesn't support the same kind of
guest memory isolation as Xen does?


Juergen
Joao Martins Feb. 21, 2019, 11:45 a.m. UTC | #6
On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> On 20/02/19 21:15, Joao Martins wrote:
>>  2. PV Driver support (patches 17 - 39)
>>
>>  We start by redirecting hypercalls from the backend to routines
>>  which emulate the behaviour that PV backends expect i.e. grant
>>  table and interdomain events. Next, we add support for late
>>  initialization of xenbus, followed by implementing
>>  frontend/backend communication mechanisms (i.e. grant tables and
>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>  which will setup a limited Xen environment. This uses the added
>>  functionality of Xen specific shared memory (grant tables) and
>>  notifications (event channels).
> 
> I am a bit worried by the last patches, they seem really brittle and
> prone to breakage.  I don't know Xen well enough to understand if the
> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> define a completely different hypercall.
> 
I guess Ankur already answered this; so just to stack this on top of his comment.

The xen_shim_domain() is only meant to handle the case where the backend
has/can-have full access to guest memory [i.e. netback and blkback would work
with similar assumptions as vhost?]. For the normal case, where a backend *in a
guest* maps and unmaps other guest memory, this is not applicable and these
changes don't affect that case.

IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
actual hypercalls but rather invoking shim_hypercall(). The call chain would go
more or less like:

<netback|blkback|scsiback>
 gnttab_map_refs(map_ops, pages)
   HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
     shim_hypercall()
       shim_hcall_gntmap()

Our reasoning was that given we are already in KVM, why mapping a page if the
user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
the shim determines its user doesn't want to map the page. Also, there's another
issue where PV backends always need a struct page to reference the device
inflight data as Ankur pointed out.

> Of course, tests are missing.

FWIW: this was deliberate as we wanted to get folks impressions before
proceeding further with the work.

> You should use the
> tools/testing/selftests/kvm/ framework, and ideally each patch should
> come with coverage for the newly-added code.
> Got it.

Cheers,
	Joao
Joao Martins Feb. 21, 2019, 11:55 a.m. UTC | #7
On 2/20/19 11:39 PM, Marek Marczykowski-Górecki wrote:
> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>>  2. PV Driver support (patches 17 - 39)
>>
>>  We start by redirecting hypercalls from the backend to routines
>>  which emulate the behaviour that PV backends expect i.e. grant
>>  table and interdomain events. Next, we add support for late
>>  initialization of xenbus, followed by implementing
>>  frontend/backend communication mechanisms (i.e. grant tables and
>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>  which will setup a limited Xen environment. This uses the added
>>  functionality of Xen specific shared memory (grant tables) and
>>  notifications (event channels).
> 
> Does it mean backends could be run in another guest, similarly as on
> real Xen? AFAIK virtio doesn't allow that as virtio backends need
> arbitrary write access to guest memory. But grant tables provide enough
> abstraction to do that safely.
> 
In this series not yet. Here we are trying to resemble how KVM drives its kernel
PV backends (i.e. vhost virtio).

The domU grant {un,}mapping could be added complementary. The main difference
between domU gntmap vs shim gntmap is that the former maps/unmaps the  guest
page on the backend. Most of it is common code I think.

	Joao
Joao Martins Feb. 21, 2019, noon UTC | #8
On 2/21/19 7:57 AM, Juergen Gross wrote:
> On 21/02/2019 00:39, Marek Marczykowski-Górecki wrote:
>> On Wed, Feb 20, 2019 at 08:15:30PM +0000, Joao Martins wrote:
>>>  2. PV Driver support (patches 17 - 39)
>>>
>>>  We start by redirecting hypercalls from the backend to routines
>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>  table and interdomain events. Next, we add support for late
>>>  initialization of xenbus, followed by implementing
>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>  which will setup a limited Xen environment. This uses the added
>>>  functionality of Xen specific shared memory (grant tables) and
>>>  notifications (event channels).
>>
>> Does it mean backends could be run in another guest, similarly as on
>> real Xen? AFAIK virtio doesn't allow that as virtio backends need
>> arbitrary write access to guest memory. But grant tables provide enough
>> abstraction to do that safely.
> 
> As long as the grant table emulation in xen-shim isn't just a wrapper to
> "normal" KVM guest memory access.
> 
> I guess the xen-shim implementation doesn't support the same kind of
> guest memory isolation as Xen does?
> 
It doesn't, but it's also two different usecases.

The xen-shim is meant to when PV backend lives in the hypervisor (similar model
as KVM vhost), whereas domU grant mapping that Marek is asking about require
additional hypercalls handled by guest (i.e. in kvm_xen_hypercall). This would
equate to how Xen currently performs grant mapping/unmapping.

	Joao
Paolo Bonzini Feb. 22, 2019, 4:59 p.m. UTC | #9
On 21/02/19 12:45, Joao Martins wrote:
> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>> On 20/02/19 21:15, Joao Martins wrote:
>>>  2. PV Driver support (patches 17 - 39)
>>>
>>>  We start by redirecting hypercalls from the backend to routines
>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>  table and interdomain events. Next, we add support for late
>>>  initialization of xenbus, followed by implementing
>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>  which will setup a limited Xen environment. This uses the added
>>>  functionality of Xen specific shared memory (grant tables) and
>>>  notifications (event channels).
>>
>> I am a bit worried by the last patches, they seem really brittle and
>> prone to breakage.  I don't know Xen well enough to understand if the
>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>> define a completely different hypercall.
>>
> I guess Ankur already answered this; so just to stack this on top of his comment.
> 
> The xen_shim_domain() is only meant to handle the case where the backend
> has/can-have full access to guest memory [i.e. netback and blkback would work
> with similar assumptions as vhost?]. For the normal case, where a backend *in a
> guest* maps and unmaps other guest memory, this is not applicable and these
> changes don't affect that case.
> 
> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
> more or less like:
> 
> <netback|blkback|scsiback>
>  gnttab_map_refs(map_ops, pages)
>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>      shim_hypercall()
>        shim_hcall_gntmap()
> 
> Our reasoning was that given we are already in KVM, why mapping a page if the
> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
> the shim determines its user doesn't want to map the page. Also, there's another
> issue where PV backends always need a struct page to reference the device
> inflight data as Ankur pointed out.

Ultimately it's up to the Xen people.  It does make their API uglier,
especially the in/out change for the parameter.  If you can at least
avoid that, it would alleviate my concerns quite a bit.

Paolo
Joao Martins March 12, 2019, 5:14 p.m. UTC | #10
On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> On 21/02/19 12:45, Joao Martins wrote:
>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>  2. PV Driver support (patches 17 - 39)
>>>>
>>>>  We start by redirecting hypercalls from the backend to routines
>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>  table and interdomain events. Next, we add support for late
>>>>  initialization of xenbus, followed by implementing
>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>  which will setup a limited Xen environment. This uses the added
>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>  notifications (event channels).
>>>
>>> I am a bit worried by the last patches, they seem really brittle and
>>> prone to breakage.  I don't know Xen well enough to understand if the
>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>> define a completely different hypercall.
>>>
>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>
>> The xen_shim_domain() is only meant to handle the case where the backend
>> has/can-have full access to guest memory [i.e. netback and blkback would work
>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>> guest* maps and unmaps other guest memory, this is not applicable and these
>> changes don't affect that case.
>>
>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>> more or less like:
>>
>> <netback|blkback|scsiback>
>>  gnttab_map_refs(map_ops, pages)
>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>      shim_hypercall()
>>        shim_hcall_gntmap()
>>
>> Our reasoning was that given we are already in KVM, why mapping a page if the
>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>> the shim determines its user doesn't want to map the page. Also, there's another
>> issue where PV backends always need a struct page to reference the device
>> inflight data as Ankur pointed out.
> 
> Ultimately it's up to the Xen people.  It does make their API uglier,
> especially the in/out change for the parameter.  If you can at least
> avoid that, it would alleviate my concerns quite a bit.

In my view, we have two options overall:

1) Make it explicit, the changes the PV drivers we have to make in
order to support xen_shim_domain(). This could mean e.g. a) add a callback
argument to gnttab_map_refs() that is invoked for every page that gets looked up
successfully, and inside this callback the PV driver may update it's tracking
page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
shim_domain specific bits would be a little more abstracted from Xen PV
backends. See netback example below the scissors mark. Or b) have sort of a
translate_gref() and put_gref() API that Xen PV drivers use which make it even
more explicit that there's no grant ops involved. The latter is more invasive.

2) The second option is to support guest grant mapping/unmapping [*] to allow
hosting PV backends inside the guest. This would remove the Xen changes in this
series completely. But it would require another guest being used
as netback/blkback/xenstored, and less performance than 1) (though, in theory,
it would be equivalent to what does Xen with grants/events). The only changes in
Linux Xen code is adding xenstored domain support, but that is useful on its own
outside the scope of this work.

I think there's value on both; 1) is probably more familiar for KVM users
perhaps (as it is similar to what vhost does?) while  2) equates to implementing
Xen disagregation capabilities in KVM.

Thoughts? Xen maintainers what's your take on this?

	Joao

[*] Interdomain events would also have to change.

---------------- >8 ----------------

It isn't much cleaner, but PV drivers avoid/hide a bunch of xen_shim_domain()
conditionals in the data path. It is more explicit while avoiding the in/out
parameter change in gnttab_map_refs.

diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h
index 936c0b3e0ba2..c6e47dcb7e10 100644
--- a/drivers/net/xen-netback/common.h
+++ b/drivers/net/xen-netback/common.h
@@ -158,6 +158,7 @@ struct xenvif_queue { /* Per-queue data for xenvif */
 	struct gnttab_copy tx_copy_ops[MAX_PENDING_REQS];
 	struct gnttab_map_grant_ref tx_map_ops[MAX_PENDING_REQS];
 	struct gnttab_unmap_grant_ref tx_unmap_ops[MAX_PENDING_REQS];
+	struct gnttab_page_changed page_cb[MAX_PENDING_REQS];
 	/* passed to gnttab_[un]map_refs with pages under (un)mapping */
 	struct page *pages_to_map[MAX_PENDING_REQS];
 	struct page *pages_to_unmap[MAX_PENDING_REQS];
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 80aae3a32c2a..56788d8cd813 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -324,15 +324,29 @@ struct xenvif_tx_cb {

 #define XENVIF_TX_CB(skb) ((struct xenvif_tx_cb *)(skb)->cb)

+static inline void xenvif_tx_page_changed(phys_addr_t addr, void *opaque)
+{
+	struct page **page = opaque;
+
+	*page = virt_to_page(addr);
+}
 static inline void xenvif_tx_create_map_op(struct xenvif_queue *queue,
 					   u16 pending_idx,
 					   struct xen_netif_tx_request *txp,
 					   unsigned int extra_count,
 					   struct gnttab_map_grant_ref *mop)
 {
-	queue->pages_to_map[mop-queue->tx_map_ops] = queue->mmap_pages[pending_idx];
+	u32 map_idx = mop - queue->tx_map_ops;
+
+	queue->pages_to_map[map_idx] = queue->mmap_pages[pending_idx];
+	queue->page_cb[map_idx].ctx = &queue->mmap_pages[pending_idx];
+	queue->page_cb[map_idx].cb = xenvif_tx_page_changed;
+
 	gnttab_set_map_op(mop, idx_to_kaddr(queue, pending_idx),
-			  GNTMAP_host_map | GNTMAP_readonly,
+			  GNTTAB_host_map | GNTMAP_readonly,
 			  txp->gref, queue->vif->domid);

 	memcpy(&queue->pending_tx_info[pending_idx].req, txp,
@@ -1268,7 +1283,7 @@ static inline void xenvif_tx_dealloc_action(struct
xenvif_queue *queue)
 				queue->mmap_pages[pending_idx];
 			gnttab_set_unmap_op(gop,
 					    idx_to_kaddr(queue, pending_idx),
-					    GNTMAP_host_map,
+					    GNTTAB_host_map,
 					    queue->grant_tx_handle[pending_idx]);
 			xenvif_grant_handle_reset(queue, pending_idx);
 			++gop;
@@ -1322,7 +1337,7 @@ int xenvif_tx_action(struct xenvif_queue *queue, int budget)
 	gnttab_batch_copy(queue->tx_copy_ops, nr_cops);
 	if (nr_mops != 0) {
 		ret = gnttab_map_refs(queue->tx_map_ops,
-				      NULL,
+				      NULL, queue->page_cb,
 				      queue->pages_to_map,
 				      nr_mops);
 		BUG_ON(ret);
@@ -1394,7 +1409,7 @@ void xenvif_idx_unmap(struct xenvif_queue *queue, u16
pending_idx)

 	gnttab_set_unmap_op(&tx_unmap_op,
 			    idx_to_kaddr(queue, pending_idx),
-			    GNTMAP_host_map,
+			    GNTTAB_host_map,
 			    queue->grant_tx_handle[pending_idx]);
 	xenvif_grant_handle_reset(queue, pending_idx);

@@ -1622,7 +1637,7 @@ static int __init netback_init(void)
 {
 	int rc = 0;

-	if (!xen_domain())
+	if (!xen_domain() && !xen_shim_domain_get())
 		return -ENODEV;

 	/* Allow as many queues as there are CPUs but max. 8 if user has not
@@ -1663,6 +1678,7 @@ static void __exit netback_fini(void)
 	debugfs_remove_recursive(xen_netback_dbg_root);
 #endif /* CONFIG_DEBUG_FS */
 	xenvif_xenbus_fini();
+	xen_shim_domain_put();
 }
 module_exit(netback_fini);

diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
index 7ea6fb6a2e5d..b4c9d7ff531f 100644
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -1031,6 +1031,7 @@ void gnttab_foreach_grant(struct page **pages,

 int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 		    struct gnttab_map_grant_ref *kmap_ops,
+		    struct gnttab_page_changed *page_cb,
 		    struct page **pages, unsigned int count)
 {
 	int i, ret;
@@ -1045,6 +1046,12 @@ int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 		{
 			struct xen_page_foreign *foreign;

+			if (xen_shim_domain() && page_cb) {
+				page_cb[i].cb(map_ops[i].host_addr,
+					      page_cb[i].ctx);
+				continue;
+			}
+
 			SetPageForeign(pages[i]);
 			foreign = xen_page_foreign(pages[i]);
 			foreign->domid = map_ops[i].dom;
diff --git a/include/xen/grant_table.h b/include/xen/grant_table.h
index 9bc5bc07d4d3..5e17fa08e779 100644
--- a/include/xen/grant_table.h
+++ b/include/xen/grant_table.h
@@ -55,6 +55,9 @@
 /* NR_GRANT_FRAMES must be less than or equal to that configured in Xen */
 #define NR_GRANT_FRAMES 4

+/* Selects host map only if on native Xen */
+#define GNTTAB_host_map (xen_shim_domain() ? 0 : GNTMAP_host_map)
+
 struct gnttab_free_callback {
 	struct gnttab_free_callback *next;
 	void (*fn)(void *);
@@ -78,6 +81,12 @@ struct gntab_unmap_queue_data
 	unsigned int age;
 };

+struct gnttab_page_changed
+{
+	void (*cb)(phys_addr_t addr, void *opaque);
+	void *ctx;
+};
+
 int gnttab_init(void);
 int gnttab_suspend(void);
 int gnttab_resume(void);
@@ -221,6 +230,7 @@ void gnttab_pages_clear_private(int nr_pages, struct page
**pages);

 int gnttab_map_refs(struct gnttab_map_grant_ref *map_ops,
 		    struct gnttab_map_grant_ref *kmap_ops,
+		    struct gnttab_page_changed *cb,
 		    struct page **pages, unsigned int count);
 int gnttab_unmap_refs(struct gnttab_unmap_grant_ref *unmap_ops,
 		      struct gnttab_unmap_grant_ref *kunmap_ops,
Jürgen Groß April 8, 2019, 6:44 a.m. UTC | #11
On 12/03/2019 18:14, Joao Martins wrote:
> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>> On 21/02/19 12:45, Joao Martins wrote:
>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>  2. PV Driver support (patches 17 - 39)
>>>>>
>>>>>  We start by redirecting hypercalls from the backend to routines
>>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>>  table and interdomain events. Next, we add support for late
>>>>>  initialization of xenbus, followed by implementing
>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>  which will setup a limited Xen environment. This uses the added
>>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>>  notifications (event channels).
>>>>
>>>> I am a bit worried by the last patches, they seem really brittle and
>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>> define a completely different hypercall.
>>>>
>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>
>>> The xen_shim_domain() is only meant to handle the case where the backend
>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>> changes don't affect that case.
>>>
>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>> more or less like:
>>>
>>> <netback|blkback|scsiback>
>>>  gnttab_map_refs(map_ops, pages)
>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>      shim_hypercall()
>>>        shim_hcall_gntmap()
>>>
>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>> the shim determines its user doesn't want to map the page. Also, there's another
>>> issue where PV backends always need a struct page to reference the device
>>> inflight data as Ankur pointed out.
>>
>> Ultimately it's up to the Xen people.  It does make their API uglier,
>> especially the in/out change for the parameter.  If you can at least
>> avoid that, it would alleviate my concerns quite a bit.
> 
> In my view, we have two options overall:
> 
> 1) Make it explicit, the changes the PV drivers we have to make in
> order to support xen_shim_domain(). This could mean e.g. a) add a callback
> argument to gnttab_map_refs() that is invoked for every page that gets looked up
> successfully, and inside this callback the PV driver may update it's tracking
> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
> shim_domain specific bits would be a little more abstracted from Xen PV
> backends. See netback example below the scissors mark. Or b) have sort of a
> translate_gref() and put_gref() API that Xen PV drivers use which make it even
> more explicit that there's no grant ops involved. The latter is more invasive.
> 
> 2) The second option is to support guest grant mapping/unmapping [*] to allow
> hosting PV backends inside the guest. This would remove the Xen changes in this
> series completely. But it would require another guest being used
> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
> it would be equivalent to what does Xen with grants/events). The only changes in
> Linux Xen code is adding xenstored domain support, but that is useful on its own
> outside the scope of this work.
> 
> I think there's value on both; 1) is probably more familiar for KVM users
> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
> Xen disagregation capabilities in KVM.
> 
> Thoughts? Xen maintainers what's your take on this?

What I'd like best would be a new handle (e.g. xenhost_t *) used as an
abstraction layer for this kind of stuff. It should be passed to the
backends and those would pass it on to low-level Xen drivers (xenbus,
event channels, grant table, ...).

I was planning to do that (the xenhost_t * stuff) soon in order to add
support for nested Xen using PV devices (you need two Xenstores for that
as the nested dom0 is acting as Xen backend server, while using PV
frontends for accessing the "real" world outside).

The xenhost_t should be used for:

- accessing Xenstore
- issuing and receiving events
- doing hypercalls
- grant table operations

So exactly the kind of stuff you want to do, too.


Juergen
Joao Martins April 8, 2019, 10:36 a.m. UTC | #12
On 4/8/19 7:44 AM, Juergen Gross wrote:
> On 12/03/2019 18:14, Joao Martins wrote:
>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>> On 21/02/19 12:45, Joao Martins wrote:
>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>  2. PV Driver support (patches 17 - 39)
>>>>>>
>>>>>>  We start by redirecting hypercalls from the backend to routines
>>>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>>>  table and interdomain events. Next, we add support for late
>>>>>>  initialization of xenbus, followed by implementing
>>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>  which will setup a limited Xen environment. This uses the added
>>>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>>>  notifications (event channels).
>>>>>
>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>> define a completely different hypercall.
>>>>>
>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>
>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>> changes don't affect that case.
>>>>
>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>> more or less like:
>>>>
>>>> <netback|blkback|scsiback>
>>>>  gnttab_map_refs(map_ops, pages)
>>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>      shim_hypercall()
>>>>        shim_hcall_gntmap()
>>>>
>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>> issue where PV backends always need a struct page to reference the device
>>>> inflight data as Ankur pointed out.
>>>
>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>> especially the in/out change for the parameter.  If you can at least
>>> avoid that, it would alleviate my concerns quite a bit.
>>
>> In my view, we have two options overall:
>>
>> 1) Make it explicit, the changes the PV drivers we have to make in
>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>> successfully, and inside this callback the PV driver may update it's tracking
>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>> shim_domain specific bits would be a little more abstracted from Xen PV
>> backends. See netback example below the scissors mark. Or b) have sort of a
>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>> more explicit that there's no grant ops involved. The latter is more invasive.
>>
>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>> hosting PV backends inside the guest. This would remove the Xen changes in this
>> series completely. But it would require another guest being used
>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>> it would be equivalent to what does Xen with grants/events). The only changes in
>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>> outside the scope of this work.
>>
>> I think there's value on both; 1) is probably more familiar for KVM users
>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>> Xen disagregation capabilities in KVM.
>>
>> Thoughts? Xen maintainers what's your take on this?
> 
> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
> abstraction layer for this kind of stuff. It should be passed to the
> backends and those would pass it on to low-level Xen drivers (xenbus,
> event channels, grant table, ...).
> 
So if IIRC backends would use the xenhost layer to access grants or frames
referenced by grants, and that would grok into some of this. IOW, you would have
two implementors of xenhost: one for nested remote/local events+grants and
another for this "shim domain" ?

> I was planning to do that (the xenhost_t * stuff) soon in order to add
> support for nested Xen using PV devices (you need two Xenstores for that
> as the nested dom0 is acting as Xen backend server, while using PV
> frontends for accessing the "real" world outside).
> 
> The xenhost_t should be used for:
> 
> - accessing Xenstore
> - issuing and receiving events
> - doing hypercalls
> - grant table operations
> 

In the text above, I sort of suggested a slice of this on 1.b) with a
translate_gref() and put_gref() API -- to get the page from a gref. This was
because of the flags|host_addr hurdle we depicted above wrt to using using grant
maps/unmaps. You think some of the xenhost layer would be ammenable to support
this case?

> So exactly the kind of stuff you want to do, too.
> 
Cool idea!

	Joao
Jürgen Groß April 8, 2019, 10:42 a.m. UTC | #13
On 08/04/2019 12:36, Joao Martins wrote:
> On 4/8/19 7:44 AM, Juergen Gross wrote:
>> On 12/03/2019 18:14, Joao Martins wrote:
>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>  2. PV Driver support (patches 17 - 39)
>>>>>>>
>>>>>>>  We start by redirecting hypercalls from the backend to routines
>>>>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>  table and interdomain events. Next, we add support for late
>>>>>>>  initialization of xenbus, followed by implementing
>>>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>  which will setup a limited Xen environment. This uses the added
>>>>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>>>>  notifications (event channels).
>>>>>>
>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>> define a completely different hypercall.
>>>>>>
>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>
>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>> changes don't affect that case.
>>>>>
>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>> more or less like:
>>>>>
>>>>> <netback|blkback|scsiback>
>>>>>  gnttab_map_refs(map_ops, pages)
>>>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>      shim_hypercall()
>>>>>        shim_hcall_gntmap()
>>>>>
>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>> issue where PV backends always need a struct page to reference the device
>>>>> inflight data as Ankur pointed out.
>>>>
>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>>> especially the in/out change for the parameter.  If you can at least
>>>> avoid that, it would alleviate my concerns quite a bit.
>>>
>>> In my view, we have two options overall:
>>>
>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>> successfully, and inside this callback the PV driver may update it's tracking
>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>
>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>> series completely. But it would require another guest being used
>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>> outside the scope of this work.
>>>
>>> I think there's value on both; 1) is probably more familiar for KVM users
>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>>> Xen disagregation capabilities in KVM.
>>>
>>> Thoughts? Xen maintainers what's your take on this?
>>
>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>> abstraction layer for this kind of stuff. It should be passed to the
>> backends and those would pass it on to low-level Xen drivers (xenbus,
>> event channels, grant table, ...).
>>
> So if IIRC backends would use the xenhost layer to access grants or frames
> referenced by grants, and that would grok into some of this. IOW, you would have
> two implementors of xenhost: one for nested remote/local events+grants and
> another for this "shim domain" ?

As I'd need that for nested Xen I guess that would make it 3 variants.
Probably the xen-shim variant would need more hooks, but that should be
no problem.

>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>> support for nested Xen using PV devices (you need two Xenstores for that
>> as the nested dom0 is acting as Xen backend server, while using PV
>> frontends for accessing the "real" world outside).
>>
>> The xenhost_t should be used for:
>>
>> - accessing Xenstore
>> - issuing and receiving events
>> - doing hypercalls
>> - grant table operations
>>
> 
> In the text above, I sort of suggested a slice of this on 1.b) with a
> translate_gref() and put_gref() API -- to get the page from a gref. This was
> because of the flags|host_addr hurdle we depicted above wrt to using using grant
> maps/unmaps. You think some of the xenhost layer would be ammenable to support
> this case?

I think so, yes.

> 
>> So exactly the kind of stuff you want to do, too.
>>
> Cool idea!

In the end you might make my life easier for nested Xen. :-)

Do you want to have a try with that idea or should I do that? I might be
able to start working on that in about a month.


Juergen
Joao Martins April 8, 2019, 5:31 p.m. UTC | #14
On 4/8/19 11:42 AM, Juergen Gross wrote:
> On 08/04/2019 12:36, Joao Martins wrote:
>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>  2. PV Driver support (patches 17 - 39)
>>>>>>>>
>>>>>>>>  We start by redirecting hypercalls from the backend to routines
>>>>>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>  table and interdomain events. Next, we add support for late
>>>>>>>>  initialization of xenbus, followed by implementing
>>>>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>  which will setup a limited Xen environment. This uses the added
>>>>>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>>>>>  notifications (event channels).
>>>>>>>
>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>> define a completely different hypercall.
>>>>>>>
>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>
>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>> changes don't affect that case.
>>>>>>
>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>> more or less like:
>>>>>>
>>>>>> <netback|blkback|scsiback>
>>>>>>  gnttab_map_refs(map_ops, pages)
>>>>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>      shim_hypercall()
>>>>>>        shim_hcall_gntmap()
>>>>>>
>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>> inflight data as Ankur pointed out.
>>>>>
>>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>>>> especially the in/out change for the parameter.  If you can at least
>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>
>>>> In my view, we have two options overall:
>>>>
>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>
>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>> series completely. But it would require another guest being used
>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>> outside the scope of this work.
>>>>
>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>>>> Xen disagregation capabilities in KVM.
>>>>
>>>> Thoughts? Xen maintainers what's your take on this?
>>>
>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>> abstraction layer for this kind of stuff. It should be passed to the
>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>> event channels, grant table, ...).
>>>
>> So if IIRC backends would use the xenhost layer to access grants or frames
>> referenced by grants, and that would grok into some of this. IOW, you would have
>> two implementors of xenhost: one for nested remote/local events+grants and
>> another for this "shim domain" ?
> 
> As I'd need that for nested Xen I guess that would make it 3 variants.
> Probably the xen-shim variant would need more hooks, but that should be
> no problem.
> 
I probably messed up in the short description but "nested remote/local
events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
So maybe only 2 variants are needed?

>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>> support for nested Xen using PV devices (you need two Xenstores for that
>>> as the nested dom0 is acting as Xen backend server, while using PV
>>> frontends for accessing the "real" world outside).
>>>
>>> The xenhost_t should be used for:
>>>
>>> - accessing Xenstore
>>> - issuing and receiving events
>>> - doing hypercalls
>>> - grant table operations
>>>
>>
>> In the text above, I sort of suggested a slice of this on 1.b) with a
>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>> this case?
> 
> I think so, yes.
> 
>>
>>> So exactly the kind of stuff you want to do, too.
>>>
>> Cool idea!
> 
> In the end you might make my life easier for nested Xen. :-)
> 
Hehe :)

> Do you want to have a try with that idea or should I do that? I might be
> able to start working on that in about a month.
> 
Ankur (CC'ed) will give a shot at it, and should start a new thread on this
xenhost abstraction layer.

	Joao
Stefano Stabellini April 9, 2019, 12:35 a.m. UTC | #15
On Mon, 8 Apr 2019, Joao Martins wrote:
> On 4/8/19 11:42 AM, Juergen Gross wrote:
> > On 08/04/2019 12:36, Joao Martins wrote:
> >> On 4/8/19 7:44 AM, Juergen Gross wrote:
> >>> On 12/03/2019 18:14, Joao Martins wrote:
> >>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> >>>>> On 21/02/19 12:45, Joao Martins wrote:
> >>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> >>>>>>> On 20/02/19 21:15, Joao Martins wrote:
> >>>>>>>>  2. PV Driver support (patches 17 - 39)
> >>>>>>>>
> >>>>>>>>  We start by redirecting hypercalls from the backend to routines
> >>>>>>>>  which emulate the behaviour that PV backends expect i.e. grant
> >>>>>>>>  table and interdomain events. Next, we add support for late
> >>>>>>>>  initialization of xenbus, followed by implementing
> >>>>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
> >>>>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
> >>>>>>>>  which will setup a limited Xen environment. This uses the added
> >>>>>>>>  functionality of Xen specific shared memory (grant tables) and
> >>>>>>>>  notifications (event channels).
> >>>>>>>
> >>>>>>> I am a bit worried by the last patches, they seem really brittle and
> >>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
> >>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
> >>>>>>> define a completely different hypercall.
> >>>>>>>
> >>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
> >>>>>>
> >>>>>> The xen_shim_domain() is only meant to handle the case where the backend
> >>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
> >>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
> >>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
> >>>>>> changes don't affect that case.
> >>>>>>
> >>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
> >>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
> >>>>>> more or less like:
> >>>>>>
> >>>>>> <netback|blkback|scsiback>
> >>>>>>  gnttab_map_refs(map_ops, pages)
> >>>>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
> >>>>>>      shim_hypercall()
> >>>>>>        shim_hcall_gntmap()
> >>>>>>
> >>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
> >>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
> >>>>>> the shim determines its user doesn't want to map the page. Also, there's another
> >>>>>> issue where PV backends always need a struct page to reference the device
> >>>>>> inflight data as Ankur pointed out.
> >>>>>
> >>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
> >>>>> especially the in/out change for the parameter.  If you can at least
> >>>>> avoid that, it would alleviate my concerns quite a bit.
> >>>>
> >>>> In my view, we have two options overall:
> >>>>
> >>>> 1) Make it explicit, the changes the PV drivers we have to make in
> >>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
> >>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
> >>>> successfully, and inside this callback the PV driver may update it's tracking
> >>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
> >>>> shim_domain specific bits would be a little more abstracted from Xen PV
> >>>> backends. See netback example below the scissors mark. Or b) have sort of a
> >>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
> >>>> more explicit that there's no grant ops involved. The latter is more invasive.
> >>>>
> >>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
> >>>> hosting PV backends inside the guest. This would remove the Xen changes in this
> >>>> series completely. But it would require another guest being used
> >>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
> >>>> it would be equivalent to what does Xen with grants/events). The only changes in
> >>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
> >>>> outside the scope of this work.
> >>>>
> >>>> I think there's value on both; 1) is probably more familiar for KVM users
> >>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
> >>>> Xen disagregation capabilities in KVM.
> >>>>
> >>>> Thoughts? Xen maintainers what's your take on this?
> >>>
> >>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
> >>> abstraction layer for this kind of stuff. It should be passed to the
> >>> backends and those would pass it on to low-level Xen drivers (xenbus,
> >>> event channels, grant table, ...).
> >>>
> >> So if IIRC backends would use the xenhost layer to access grants or frames
> >> referenced by grants, and that would grok into some of this. IOW, you would have
> >> two implementors of xenhost: one for nested remote/local events+grants and
> >> another for this "shim domain" ?
> > 
> > As I'd need that for nested Xen I guess that would make it 3 variants.
> > Probably the xen-shim variant would need more hooks, but that should be
> > no problem.
> > 
> I probably messed up in the short description but "nested remote/local
> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
> So maybe only 2 variants are needed?
> 
> >>> I was planning to do that (the xenhost_t * stuff) soon in order to add
> >>> support for nested Xen using PV devices (you need two Xenstores for that
> >>> as the nested dom0 is acting as Xen backend server, while using PV
> >>> frontends for accessing the "real" world outside).
> >>>
> >>> The xenhost_t should be used for:
> >>>
> >>> - accessing Xenstore
> >>> - issuing and receiving events
> >>> - doing hypercalls
> >>> - grant table operations
> >>>
> >>
> >> In the text above, I sort of suggested a slice of this on 1.b) with a
> >> translate_gref() and put_gref() API -- to get the page from a gref. This was
> >> because of the flags|host_addr hurdle we depicted above wrt to using using grant
> >> maps/unmaps. You think some of the xenhost layer would be ammenable to support
> >> this case?
> > 
> > I think so, yes.
> > 
> >>
> >>> So exactly the kind of stuff you want to do, too.
> >>>
> >> Cool idea!
> > 
> > In the end you might make my life easier for nested Xen. :-)
> > 
> Hehe :)
> 
> > Do you want to have a try with that idea or should I do that? I might be
> > able to start working on that in about a month.
> > 
> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
> xenhost abstraction layer.

If you are up for it, it would be great to write some documentation too.
We are starting to have decent docs for some PV protocols, describing a
specific PV interface, but we are lacking docs about the basic building
blocks to bring up PV drivers in general. They would be extremely
useful. Given that you have just done the work, you are in a great
position to write those docs. Even bad English would be fine, I am sure
somebody else could volunteer to clean-up the language. Anything would
help :-)
Jürgen Groß April 9, 2019, 5:04 a.m. UTC | #16
On 08/04/2019 19:31, Joao Martins wrote:
> On 4/8/19 11:42 AM, Juergen Gross wrote:
>> On 08/04/2019 12:36, Joao Martins wrote:
>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>  2. PV Driver support (patches 17 - 39)
>>>>>>>>>
>>>>>>>>>  We start by redirecting hypercalls from the backend to routines
>>>>>>>>>  which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>  table and interdomain events. Next, we add support for late
>>>>>>>>>  initialization of xenbus, followed by implementing
>>>>>>>>>  frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>>  interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>  which will setup a limited Xen environment. This uses the added
>>>>>>>>>  functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>  notifications (event channels).
>>>>>>>>
>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>> define a completely different hypercall.
>>>>>>>>
>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>
>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>> changes don't affect that case.
>>>>>>>
>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>> more or less like:
>>>>>>>
>>>>>>> <netback|blkback|scsiback>
>>>>>>>  gnttab_map_refs(map_ops, pages)
>>>>>>>    HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>      shim_hypercall()
>>>>>>>        shim_hcall_gntmap()
>>>>>>>
>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>> inflight data as Ankur pointed out.
>>>>>>
>>>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>>>>> especially the in/out change for the parameter.  If you can at least
>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>
>>>>> In my view, we have two options overall:
>>>>>
>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>
>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>> series completely. But it would require another guest being used
>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>> outside the scope of this work.
>>>>>
>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>>>>> Xen disagregation capabilities in KVM.
>>>>>
>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>
>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>> event channels, grant table, ...).
>>>>
>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>> two implementors of xenhost: one for nested remote/local events+grants and
>>> another for this "shim domain" ?
>>
>> As I'd need that for nested Xen I guess that would make it 3 variants.
>> Probably the xen-shim variant would need more hooks, but that should be
>> no problem.
>>
> I probably messed up in the short description but "nested remote/local
> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
> So maybe only 2 variants are needed?

I need one xenhost variant for the "normal" case as today: talking to
the single hypervisor (or in nested case: to the L1 hypervisor).

Then I need a variant for the nested case talking to L0 hypervisor.

And you need a variant talking to xen-shim.

The first two variants can be active in the same system in case of
nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
while its frontends are talking with L0 hypervisor.

> 
>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>> frontends for accessing the "real" world outside).
>>>>
>>>> The xenhost_t should be used for:
>>>>
>>>> - accessing Xenstore
>>>> - issuing and receiving events
>>>> - doing hypercalls
>>>> - grant table operations
>>>>
>>>
>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>> this case?
>>
>> I think so, yes.
>>
>>>
>>>> So exactly the kind of stuff you want to do, too.
>>>>
>>> Cool idea!
>>
>> In the end you might make my life easier for nested Xen. :-)
>>
> Hehe :)
> 
>> Do you want to have a try with that idea or should I do that? I might be
>> able to start working on that in about a month.
>>
> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
> xenhost abstraction layer.

Great, looking forward to it!


Juergen
Ankur Arora April 10, 2019, 5:50 a.m. UTC | #17
On 2019-04-08 5:35 p.m., Stefano Stabellini wrote:
> On Mon, 8 Apr 2019, Joao Martins wrote:
>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>>   2. PV Driver support (patches 17 - 39)
>>>>>>>>>>
>>>>>>>>>>   We start by redirecting hypercalls from the backend to routines
>>>>>>>>>>   which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>>   table and interdomain events. Next, we add support for late
>>>>>>>>>>   initialization of xenbus, followed by implementing
>>>>>>>>>>   frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>>   which will setup a limited Xen environment. This uses the added
>>>>>>>>>>   functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>>   notifications (event channels).
>>>>>>>>>
>>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>>> define a completely different hypercall.
>>>>>>>>>
>>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>>
>>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>>> changes don't affect that case.
>>>>>>>>
>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>>> more or less like:
>>>>>>>>
>>>>>>>> <netback|blkback|scsiback>
>>>>>>>>   gnttab_map_refs(map_ops, pages)
>>>>>>>>     HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>>       shim_hypercall()
>>>>>>>>         shim_hcall_gntmap()
>>>>>>>>
>>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>
>>>>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>>>>>> especially the in/out change for the parameter.  If you can at least
>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>
>>>>>> In my view, we have two options overall:
>>>>>>
>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>>
>>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>>> series completely. But it would require another guest being used
>>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>>> outside the scope of this work.
>>>>>>
>>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>>>>>> Xen disagregation capabilities in KVM.
>>>>>>
>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>
>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>> event channels, grant table, ...).
>>>>>
>>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>>> two implementors of xenhost: one for nested remote/local events+grants and
>>>> another for this "shim domain" ?
>>>
>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>> Probably the xen-shim variant would need more hooks, but that should be
>>> no problem.
>>>
>> I probably messed up in the short description but "nested remote/local
>> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
>> So maybe only 2 variants are needed?
>>
>>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>>> frontends for accessing the "real" world outside).
>>>>>
>>>>> The xenhost_t should be used for:
>>>>>
>>>>> - accessing Xenstore
>>>>> - issuing and receiving events
>>>>> - doing hypercalls
>>>>> - grant table operations
>>>>>
>>>>
>>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>>> this case?
>>>
>>> I think so, yes.
>>>
>>>>
>>>>> So exactly the kind of stuff you want to do, too.
>>>>>
>>>> Cool idea!
>>>
>>> In the end you might make my life easier for nested Xen. :-)
>>>
>> Hehe :)
>>
>>> Do you want to have a try with that idea or should I do that? I might be
>>> able to start working on that in about a month.
>>>
>> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
>> xenhost abstraction layer.
> 
> If you are up for it, it would be great to write some documentation too.
> We are starting to have decent docs for some PV protocols, describing a
> specific PV interface, but we are lacking docs about the basic building
> blocks to bring up PV drivers in general. They would be extremely
Agreed. These would be useful.

> useful. Given that you have just done the work, you are in a great
> position to write those docs. Even bad English would be fine, I am sure
> somebody else could volunteer to clean-up the language. Anything would
> help :-)
Can't make any promises on this yet but I will see if I can convert
notes I made into something useful for the community.


Ankur

> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
>
Ankur Arora April 10, 2019, 6:55 a.m. UTC | #18
On 2019-04-08 10:04 p.m., Juergen Gross wrote:
> On 08/04/2019 19:31, Joao Martins wrote:
>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>>   2. PV Driver support (patches 17 - 39)
>>>>>>>>>>
>>>>>>>>>>   We start by redirecting hypercalls from the backend to routines
>>>>>>>>>>   which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>>   table and interdomain events. Next, we add support for late
>>>>>>>>>>   initialization of xenbus, followed by implementing
>>>>>>>>>>   frontend/backend communication mechanisms (i.e. grant tables and
>>>>>>>>>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>>   which will setup a limited Xen environment. This uses the added
>>>>>>>>>>   functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>>   notifications (event channels).
>>>>>>>>>
>>>>>>>>> I am a bit worried by the last patches, they seem really brittle and
>>>>>>>>> prone to breakage.  I don't know Xen well enough to understand if the
>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not, you have to
>>>>>>>>> define a completely different hypercall.
>>>>>>>>>
>>>>>>>> I guess Ankur already answered this; so just to stack this on top of his comment.
>>>>>>>>
>>>>>>>> The xen_shim_domain() is only meant to handle the case where the backend
>>>>>>>> has/can-have full access to guest memory [i.e. netback and blkback would work
>>>>>>>> with similar assumptions as vhost?]. For the normal case, where a backend *in a
>>>>>>>> guest* maps and unmaps other guest memory, this is not applicable and these
>>>>>>>> changes don't affect that case.
>>>>>>>>
>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the hypercalls aren't
>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The call chain would go
>>>>>>>> more or less like:
>>>>>>>>
>>>>>>>> <netback|blkback|scsiback>
>>>>>>>>   gnttab_map_refs(map_ops, pages)
>>>>>>>>     HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>>       shim_hypercall()
>>>>>>>>         shim_hcall_gntmap()
>>>>>>>>
>>>>>>>> Our reasoning was that given we are already in KVM, why mapping a page if the
>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of GNTMAP_host_map is how
>>>>>>>> the shim determines its user doesn't want to map the page. Also, there's another
>>>>>>>> issue where PV backends always need a struct page to reference the device
>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>
>>>>>>> Ultimately it's up to the Xen people.  It does make their API uglier,
>>>>>>> especially the in/out change for the parameter.  If you can at least
>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>
>>>>>> In my view, we have two options overall:
>>>>>>
>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a callback
>>>>>> argument to gnttab_map_refs() that is invoked for every page that gets looked up
>>>>>> successfully, and inside this callback the PV driver may update it's tracking
>>>>>> page. Here we no longer have this in/out parameter in gnttab_map_refs, and all
>>>>>> shim_domain specific bits would be a little more abstracted from Xen PV
>>>>>> backends. See netback example below the scissors mark. Or b) have sort of a
>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which make it even
>>>>>> more explicit that there's no grant ops involved. The latter is more invasive.
>>>>>>
>>>>>> 2) The second option is to support guest grant mapping/unmapping [*] to allow
>>>>>> hosting PV backends inside the guest. This would remove the Xen changes in this
>>>>>> series completely. But it would require another guest being used
>>>>>> as netback/blkback/xenstored, and less performance than 1) (though, in theory,
>>>>>> it would be equivalent to what does Xen with grants/events). The only changes in
>>>>>> Linux Xen code is adding xenstored domain support, but that is useful on its own
>>>>>> outside the scope of this work.
>>>>>>
>>>>>> I think there's value on both; 1) is probably more familiar for KVM users
>>>>>> perhaps (as it is similar to what vhost does?) while  2) equates to implementing
>>>>>> Xen disagregation capabilities in KVM.
>>>>>>
>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>
>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used as an
>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>> event channels, grant table, ...).
>>>>>
>>>> So if IIRC backends would use the xenhost layer to access grants or frames
>>>> referenced by grants, and that would grok into some of this. IOW, you would have
>>>> two implementors of xenhost: one for nested remote/local events+grants and
>>>> another for this "shim domain" ?
>>>
>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>> Probably the xen-shim variant would need more hooks, but that should be
>>> no problem.
>>>
>> I probably messed up in the short description but "nested remote/local
>> events+grants" was referring to nested Xen (FWIW remote meant L0 and local L1).
>> So maybe only 2 variants are needed?
> 
> I need one xenhost variant for the "normal" case as today: talking to
> the single hypervisor (or in nested case: to the L1 hypervisor).
> 
> Then I need a variant for the nested case talking to L0 hypervisor.
> 
> And you need a variant talking to xen-shim.
> 
> The first two variants can be active in the same system in case of
> nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
> while its frontends are talking with L0 hypervisor.
Thanks this is clarifying.

So, essentially backend drivers with a xenhost_t handle, communicate
with Xen low-level drivers etc using the same handle, however, if they
communicate with frontend drivers for accessing the "real" world,
they exclusively use standard mechanisms (Linux or hypercalls)?

In this scenario L2 dom0 xen-netback and L2 dom0 xen-netfront should
just be able to use Linux interfaces. But if L2 dom0 xenbus-backend
needs to talk to L2 dom0 xenbus-frontend then do you see them layered
or are they still exclusively talking via the standard mechanisms?

Ankur

> 
>>
>>>>> I was planning to do that (the xenhost_t * stuff) soon in order to add
>>>>> support for nested Xen using PV devices (you need two Xenstores for that
>>>>> as the nested dom0 is acting as Xen backend server, while using PV
>>>>> frontends for accessing the "real" world outside).
>>>>>
>>>>> The xenhost_t should be used for:
>>>>>
>>>>> - accessing Xenstore
>>>>> - issuing and receiving events
>>>>> - doing hypercalls
>>>>> - grant table operations
>>>>>
>>>>
>>>> In the text above, I sort of suggested a slice of this on 1.b) with a
>>>> translate_gref() and put_gref() API -- to get the page from a gref. This was
>>>> because of the flags|host_addr hurdle we depicted above wrt to using using grant
>>>> maps/unmaps. You think some of the xenhost layer would be ammenable to support
>>>> this case?
>>>
>>> I think so, yes.
>>>
>>>>
>>>>> So exactly the kind of stuff you want to do, too.
>>>>>
>>>> Cool idea!
>>>
>>> In the end you might make my life easier for nested Xen. :-)
>>>
>> Hehe :)
>>
>>> Do you want to have a try with that idea or should I do that? I might be
>>> able to start working on that in about a month.
>>>
>> Ankur (CC'ed) will give a shot at it, and should start a new thread on this
>> xenhost abstraction layer.
> 
> Great, looking forward to it!
> 
> 
> Juergen
>
Jürgen Groß April 10, 2019, 7:14 a.m. UTC | #19
On 10/04/2019 08:55, Ankur Arora wrote:
> On 2019-04-08 10:04 p.m., Juergen Gross wrote:
>> On 08/04/2019 19:31, Joao Martins wrote:
>>> On 4/8/19 11:42 AM, Juergen Gross wrote:
>>>> On 08/04/2019 12:36, Joao Martins wrote:
>>>>> On 4/8/19 7:44 AM, Juergen Gross wrote:
>>>>>> On 12/03/2019 18:14, Joao Martins wrote:
>>>>>>> On 2/22/19 4:59 PM, Paolo Bonzini wrote:
>>>>>>>> On 21/02/19 12:45, Joao Martins wrote:
>>>>>>>>> On 2/20/19 9:09 PM, Paolo Bonzini wrote:
>>>>>>>>>> On 20/02/19 21:15, Joao Martins wrote:
>>>>>>>>>>>   2. PV Driver support (patches 17 - 39)
>>>>>>>>>>>
>>>>>>>>>>>   We start by redirecting hypercalls from the backend to
>>>>>>>>>>> routines
>>>>>>>>>>>   which emulate the behaviour that PV backends expect i.e. grant
>>>>>>>>>>>   table and interdomain events. Next, we add support for late
>>>>>>>>>>>   initialization of xenbus, followed by implementing
>>>>>>>>>>>   frontend/backend communication mechanisms (i.e. grant
>>>>>>>>>>> tables and
>>>>>>>>>>>   interdomain event channels). Finally, introduce xen-shim.ko,
>>>>>>>>>>>   which will setup a limited Xen environment. This uses the
>>>>>>>>>>> added
>>>>>>>>>>>   functionality of Xen specific shared memory (grant tables) and
>>>>>>>>>>>   notifications (event channels).
>>>>>>>>>>
>>>>>>>>>> I am a bit worried by the last patches, they seem really
>>>>>>>>>> brittle and
>>>>>>>>>> prone to breakage.  I don't know Xen well enough to understand
>>>>>>>>>> if the
>>>>>>>>>> lack of support for GNTMAP_host_map is fixable, but if not,
>>>>>>>>>> you have to
>>>>>>>>>> define a completely different hypercall.
>>>>>>>>>>
>>>>>>>>> I guess Ankur already answered this; so just to stack this on
>>>>>>>>> top of his comment.
>>>>>>>>>
>>>>>>>>> The xen_shim_domain() is only meant to handle the case where
>>>>>>>>> the backend
>>>>>>>>> has/can-have full access to guest memory [i.e. netback and
>>>>>>>>> blkback would work
>>>>>>>>> with similar assumptions as vhost?]. For the normal case, where
>>>>>>>>> a backend *in a
>>>>>>>>> guest* maps and unmaps other guest memory, this is not
>>>>>>>>> applicable and these
>>>>>>>>> changes don't affect that case.
>>>>>>>>>
>>>>>>>>> IOW, the PV backend here sits on the hypervisor, and the
>>>>>>>>> hypercalls aren't
>>>>>>>>> actual hypercalls but rather invoking shim_hypercall(). The
>>>>>>>>> call chain would go
>>>>>>>>> more or less like:
>>>>>>>>>
>>>>>>>>> <netback|blkback|scsiback>
>>>>>>>>>   gnttab_map_refs(map_ops, pages)
>>>>>>>>>     HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
>>>>>>>>>       shim_hypercall()
>>>>>>>>>         shim_hcall_gntmap()
>>>>>>>>>
>>>>>>>>> Our reasoning was that given we are already in KVM, why mapping
>>>>>>>>> a page if the
>>>>>>>>> user (i.e. the kernel PV backend) is himself? The lack of
>>>>>>>>> GNTMAP_host_map is how
>>>>>>>>> the shim determines its user doesn't want to map the page.
>>>>>>>>> Also, there's another
>>>>>>>>> issue where PV backends always need a struct page to reference
>>>>>>>>> the device
>>>>>>>>> inflight data as Ankur pointed out.
>>>>>>>>
>>>>>>>> Ultimately it's up to the Xen people.  It does make their API
>>>>>>>> uglier,
>>>>>>>> especially the in/out change for the parameter.  If you can at
>>>>>>>> least
>>>>>>>> avoid that, it would alleviate my concerns quite a bit.
>>>>>>>
>>>>>>> In my view, we have two options overall:
>>>>>>>
>>>>>>> 1) Make it explicit, the changes the PV drivers we have to make in
>>>>>>> order to support xen_shim_domain(). This could mean e.g. a) add a
>>>>>>> callback
>>>>>>> argument to gnttab_map_refs() that is invoked for every page that
>>>>>>> gets looked up
>>>>>>> successfully, and inside this callback the PV driver may update
>>>>>>> it's tracking
>>>>>>> page. Here we no longer have this in/out parameter in
>>>>>>> gnttab_map_refs, and all
>>>>>>> shim_domain specific bits would be a little more abstracted from
>>>>>>> Xen PV
>>>>>>> backends. See netback example below the scissors mark. Or b) have
>>>>>>> sort of a
>>>>>>> translate_gref() and put_gref() API that Xen PV drivers use which
>>>>>>> make it even
>>>>>>> more explicit that there's no grant ops involved. The latter is
>>>>>>> more invasive.
>>>>>>>
>>>>>>> 2) The second option is to support guest grant mapping/unmapping
>>>>>>> [*] to allow
>>>>>>> hosting PV backends inside the guest. This would remove the Xen
>>>>>>> changes in this
>>>>>>> series completely. But it would require another guest being used
>>>>>>> as netback/blkback/xenstored, and less performance than 1)
>>>>>>> (though, in theory,
>>>>>>> it would be equivalent to what does Xen with grants/events). The
>>>>>>> only changes in
>>>>>>> Linux Xen code is adding xenstored domain support, but that is
>>>>>>> useful on its own
>>>>>>> outside the scope of this work.
>>>>>>>
>>>>>>> I think there's value on both; 1) is probably more familiar for
>>>>>>> KVM users
>>>>>>> perhaps (as it is similar to what vhost does?) while  2) equates
>>>>>>> to implementing
>>>>>>> Xen disagregation capabilities in KVM.
>>>>>>>
>>>>>>> Thoughts? Xen maintainers what's your take on this?
>>>>>>
>>>>>> What I'd like best would be a new handle (e.g. xenhost_t *) used
>>>>>> as an
>>>>>> abstraction layer for this kind of stuff. It should be passed to the
>>>>>> backends and those would pass it on to low-level Xen drivers (xenbus,
>>>>>> event channels, grant table, ...).
>>>>>>
>>>>> So if IIRC backends would use the xenhost layer to access grants or
>>>>> frames
>>>>> referenced by grants, and that would grok into some of this. IOW,
>>>>> you would have
>>>>> two implementors of xenhost: one for nested remote/local
>>>>> events+grants and
>>>>> another for this "shim domain" ?
>>>>
>>>> As I'd need that for nested Xen I guess that would make it 3 variants.
>>>> Probably the xen-shim variant would need more hooks, but that should be
>>>> no problem.
>>>>
>>> I probably messed up in the short description but "nested remote/local
>>> events+grants" was referring to nested Xen (FWIW remote meant L0 and
>>> local L1).
>>> So maybe only 2 variants are needed?
>>
>> I need one xenhost variant for the "normal" case as today: talking to
>> the single hypervisor (or in nested case: to the L1 hypervisor).
>>
>> Then I need a variant for the nested case talking to L0 hypervisor.
>>
>> And you need a variant talking to xen-shim.
>>
>> The first two variants can be active in the same system in case of
>> nested Xen: the backends of L2 dom0 are talking to L1 hypervisor,
>> while its frontends are talking with L0 hypervisor.
> Thanks this is clarifying.
> 
> So, essentially backend drivers with a xenhost_t handle, communicate
> with Xen low-level drivers etc using the same handle, however, if they
> communicate with frontend drivers for accessing the "real" world,
> they exclusively use standard mechanisms (Linux or hypercalls)?

This should be opaque to the backends. The xenhost_t handle should have
a pointer to a function vector for relevant grant-, event- and Xenstore-
related functions. Calls to such functions should be done via an inline
function with the xenhost_t handle being one parameter, that function
will then call the correct implementation.

> In this scenario L2 dom0 xen-netback and L2 dom0 xen-netfront should
> just be able to use Linux interfaces. But if L2 dom0 xenbus-backend
> needs to talk to L2 dom0 xenbus-frontend then do you see them layered
> or are they still exclusively talking via the standard mechanisms?

The distinction is made via the function vector in xenhost_t. So the
only change in backends needed is the introduction of xenhost_t.

Whether we want to introduce xenhost_t in frontends, too, is TBD.


Juergen
Stefano Stabellini April 10, 2019, 8:45 p.m. UTC | #20
On Tue, 9 Apr 2019, Ankur Arora wrote:
> On 2019-04-08 5:35 p.m., Stefano Stabellini wrote:
> > On Mon, 8 Apr 2019, Joao Martins wrote:
> > > On 4/8/19 11:42 AM, Juergen Gross wrote:
> > > > On 08/04/2019 12:36, Joao Martins wrote:
> > > > > On 4/8/19 7:44 AM, Juergen Gross wrote:
> > > > > > On 12/03/2019 18:14, Joao Martins wrote:
> > > > > > > On 2/22/19 4:59 PM, Paolo Bonzini wrote:
> > > > > > > > On 21/02/19 12:45, Joao Martins wrote:
> > > > > > > > > On 2/20/19 9:09 PM, Paolo Bonzini wrote:
> > > > > > > > > > On 20/02/19 21:15, Joao Martins wrote:
> > > > > > > > > > >   2. PV Driver support (patches 17 - 39)
> > > > > > > > > > > 
> > > > > > > > > > >   We start by redirecting hypercalls from the backend to
> > > > > > > > > > > routines
> > > > > > > > > > >   which emulate the behaviour that PV backends expect i.e.
> > > > > > > > > > > grant
> > > > > > > > > > >   table and interdomain events. Next, we add support for
> > > > > > > > > > > late
> > > > > > > > > > >   initialization of xenbus, followed by implementing
> > > > > > > > > > >   frontend/backend communication mechanisms (i.e. grant
> > > > > > > > > > > tables and
> > > > > > > > > > >   interdomain event channels). Finally, introduce
> > > > > > > > > > > xen-shim.ko,
> > > > > > > > > > >   which will setup a limited Xen environment. This uses
> > > > > > > > > > > the added
> > > > > > > > > > >   functionality of Xen specific shared memory (grant
> > > > > > > > > > > tables) and
> > > > > > > > > > >   notifications (event channels).
> > > > > > > > > > 
> > > > > > > > > > I am a bit worried by the last patches, they seem really
> > > > > > > > > > brittle and
> > > > > > > > > > prone to breakage.  I don't know Xen well enough to
> > > > > > > > > > understand if the
> > > > > > > > > > lack of support for GNTMAP_host_map is fixable, but if not,
> > > > > > > > > > you have to
> > > > > > > > > > define a completely different hypercall.
> > > > > > > > > > 
> > > > > > > > > I guess Ankur already answered this; so just to stack this on
> > > > > > > > > top of his comment.
> > > > > > > > > 
> > > > > > > > > The xen_shim_domain() is only meant to handle the case where
> > > > > > > > > the backend
> > > > > > > > > has/can-have full access to guest memory [i.e. netback and
> > > > > > > > > blkback would work
> > > > > > > > > with similar assumptions as vhost?]. For the normal case,
> > > > > > > > > where a backend *in a
> > > > > > > > > guest* maps and unmaps other guest memory, this is not
> > > > > > > > > applicable and these
> > > > > > > > > changes don't affect that case.
> > > > > > > > > 
> > > > > > > > > IOW, the PV backend here sits on the hypervisor, and the
> > > > > > > > > hypercalls aren't
> > > > > > > > > actual hypercalls but rather invoking shim_hypercall(). The
> > > > > > > > > call chain would go
> > > > > > > > > more or less like:
> > > > > > > > > 
> > > > > > > > > <netback|blkback|scsiback>
> > > > > > > > >   gnttab_map_refs(map_ops, pages)
> > > > > > > > >     HYPERVISOR_grant_table_op(GNTTABOP_map_grant_ref,...)
> > > > > > > > >       shim_hypercall()
> > > > > > > > >         shim_hcall_gntmap()
> > > > > > > > > 
> > > > > > > > > Our reasoning was that given we are already in KVM, why
> > > > > > > > > mapping a page if the
> > > > > > > > > user (i.e. the kernel PV backend) is himself? The lack of
> > > > > > > > > GNTMAP_host_map is how
> > > > > > > > > the shim determines its user doesn't want to map the page.
> > > > > > > > > Also, there's another
> > > > > > > > > issue where PV backends always need a struct page to reference
> > > > > > > > > the device
> > > > > > > > > inflight data as Ankur pointed out.
> > > > > > > > 
> > > > > > > > Ultimately it's up to the Xen people.  It does make their API
> > > > > > > > uglier,
> > > > > > > > especially the in/out change for the parameter.  If you can at
> > > > > > > > least
> > > > > > > > avoid that, it would alleviate my concerns quite a bit.
> > > > > > > 
> > > > > > > In my view, we have two options overall:
> > > > > > > 
> > > > > > > 1) Make it explicit, the changes the PV drivers we have to make in
> > > > > > > order to support xen_shim_domain(). This could mean e.g. a) add a
> > > > > > > callback
> > > > > > > argument to gnttab_map_refs() that is invoked for every page that
> > > > > > > gets looked up
> > > > > > > successfully, and inside this callback the PV driver may update
> > > > > > > it's tracking
> > > > > > > page. Here we no longer have this in/out parameter in
> > > > > > > gnttab_map_refs, and all
> > > > > > > shim_domain specific bits would be a little more abstracted from
> > > > > > > Xen PV
> > > > > > > backends. See netback example below the scissors mark. Or b) have
> > > > > > > sort of a
> > > > > > > translate_gref() and put_gref() API that Xen PV drivers use which
> > > > > > > make it even
> > > > > > > more explicit that there's no grant ops involved. The latter is
> > > > > > > more invasive.
> > > > > > > 
> > > > > > > 2) The second option is to support guest grant mapping/unmapping
> > > > > > > [*] to allow
> > > > > > > hosting PV backends inside the guest. This would remove the Xen
> > > > > > > changes in this
> > > > > > > series completely. But it would require another guest being used
> > > > > > > as netback/blkback/xenstored, and less performance than 1)
> > > > > > > (though, in theory,
> > > > > > > it would be equivalent to what does Xen with grants/events). The
> > > > > > > only changes in
> > > > > > > Linux Xen code is adding xenstored domain support, but that is
> > > > > > > useful on its own
> > > > > > > outside the scope of this work.
> > > > > > > 
> > > > > > > I think there's value on both; 1) is probably more familiar for
> > > > > > > KVM users
> > > > > > > perhaps (as it is similar to what vhost does?) while  2) equates
> > > > > > > to implementing
> > > > > > > Xen disagregation capabilities in KVM.
> > > > > > > 
> > > > > > > Thoughts? Xen maintainers what's your take on this?
> > > > > > 
> > > > > > What I'd like best would be a new handle (e.g. xenhost_t *) used as
> > > > > > an
> > > > > > abstraction layer for this kind of stuff. It should be passed to the
> > > > > > backends and those would pass it on to low-level Xen drivers
> > > > > > (xenbus,
> > > > > > event channels, grant table, ...).
> > > > > > 
> > > > > So if IIRC backends would use the xenhost layer to access grants or
> > > > > frames
> > > > > referenced by grants, and that would grok into some of this. IOW, you
> > > > > would have
> > > > > two implementors of xenhost: one for nested remote/local events+grants
> > > > > and
> > > > > another for this "shim domain" ?
> > > > 
> > > > As I'd need that for nested Xen I guess that would make it 3 variants.
> > > > Probably the xen-shim variant would need more hooks, but that should be
> > > > no problem.
> > > > 
> > > I probably messed up in the short description but "nested remote/local
> > > events+grants" was referring to nested Xen (FWIW remote meant L0 and local
> > > L1).
> > > So maybe only 2 variants are needed?
> > > 
> > > > > > I was planning to do that (the xenhost_t * stuff) soon in order to
> > > > > > add
> > > > > > support for nested Xen using PV devices (you need two Xenstores for
> > > > > > that
> > > > > > as the nested dom0 is acting as Xen backend server, while using PV
> > > > > > frontends for accessing the "real" world outside).
> > > > > > 
> > > > > > The xenhost_t should be used for:
> > > > > > 
> > > > > > - accessing Xenstore
> > > > > > - issuing and receiving events
> > > > > > - doing hypercalls
> > > > > > - grant table operations
> > > > > > 
> > > > > 
> > > > > In the text above, I sort of suggested a slice of this on 1.b) with a
> > > > > translate_gref() and put_gref() API -- to get the page from a gref.
> > > > > This was
> > > > > because of the flags|host_addr hurdle we depicted above wrt to using
> > > > > using grant
> > > > > maps/unmaps. You think some of the xenhost layer would be ammenable to
> > > > > support
> > > > > this case?
> > > > 
> > > > I think so, yes.
> > > > 
> > > > > 
> > > > > > So exactly the kind of stuff you want to do, too.
> > > > > > 
> > > > > Cool idea!
> > > > 
> > > > In the end you might make my life easier for nested Xen. :-)
> > > > 
> > > Hehe :)
> > > 
> > > > Do you want to have a try with that idea or should I do that? I might be
> > > > able to start working on that in about a month.
> > > > 
> > > Ankur (CC'ed) will give a shot at it, and should start a new thread on
> > > this
> > > xenhost abstraction layer.
> > 
> > If you are up for it, it would be great to write some documentation too.
> > We are starting to have decent docs for some PV protocols, describing a
> > specific PV interface, but we are lacking docs about the basic building
> > blocks to bring up PV drivers in general. They would be extremely
> Agreed. These would be useful.
> 
> > useful. Given that you have just done the work, you are in a great
> > position to write those docs. Even bad English would be fine, I am sure
> > somebody else could volunteer to clean-up the language. Anything would
> > help :-)
> Can't make any promises on this yet but I will see if I can convert
> notes I made into something useful for the community.

Thank you!