Message ID | 20180726195429.31960-23-borntraeger@de.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vfio-ap: guest dedicated crypto adapters | expand |
On Thu, 26 Jul 2018 21:54:29 +0200 Christian Borntraeger <borntraeger@de.ibm.com> wrote: ... > +The process for reserving an AP queue for use by a KVM guest is: > + > +* The vfio-ap driver during its initialization will perform the following: > + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap > + * Create the 'matrix' device in the 'vfio_ap' root > + * Register the matrix device with the device core > +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and > + newer) and to provide the vfio_ap driver's probe and remove callback > + interfaces. The reason why older devices are not supported is because there > + are no systems available on which to test. > +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be > + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain > + mask to /sys/bus/ap/aqmask. > + > + For example to reserve all the AP Queues on the system for vfio_ap driver: > + > + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask > + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask And this is a reasonable user syntax? ;) ... > + * mdev_attr_groups > + This attribute group identifies the user-defined sysfs attributes of the > + mediated device. When a device is registered with the VFIO mediated device > + framework, the sysfs attributes files identified in the 'mdev_attr_groups' > + structure will be created in the mediated matrix device's directory. The > + sysfs attributes for a mediated matrix device are: > + * assign_adapter: > + A write-only file for assigning an AP adapter to the mediated matrix > + device. To assign an adapter, the APID of the adapter is written to the > + file. > + * assign_domain: > + A write-only file for assigning an AP usage domain to the mediated matrix > + device. To assign a domain, the APQI of the AP queue corresponding to a > + usage domain is written to the file. > + * matrix: > + A read-only file for displaying the APQNs derived from the adapters and > + domains assigned to the mediated matrix device. > + * assign_control_domain: > + A write-only file for assigning an AP control domain to the mediated > + matrix device. To assign a control domain, the ID of a domain to be > + controlled is written to the file. For the initial implementation, the set > + of control domains will always include the set of usage domains, so it is > + only necessary to assign control domains that are not also assigned as > + usage domains. How will the user know when this changes? What's the transition plan to maintain compatibility when it does change? ... > +4. The administrator now needs to configure the matrixes for mediated > + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). > + > + This is how the matrix is configured for Guest1: > + > + echo 5 > assign_adapter > + echo 6 > assign_adapter > + echo 4 > assign_domain > + echo 0xab > assign_domain > + > + For this implementation, all usage domains - i.e., domains assigned > + via the assign_domain attribute file - will also be configured in the ADM > + field of the KVM guest's CRYCB, so there is no need to assign control > + domains here unless you want to assign control domains that are not > + assigned as usage domains. > + > + If a mistake is made configuring an adapter, domain or control domain, > + you can use the unassign_xxx files to unassign the adapter, domain or > + control domain. Would it be more consistent with other sysfs entries to call these "bind" and "unbind" rather than "assign" and "unassign"? > + > + To display the matrix configuration for Guest1: > + > + cat matrix > + > + This is how the matrix is configured for Guest2: > + > + echo 5 > assign_adapter > + echo 0x47 > assign_domain > + echo 0xff > assign_domain > + > +5. The adminstrator now needs to activate the mediated devices. > + echo 1 > activate Or optionally not. As in reply to cover letter, I don't think this interface is sufficiently justified, or necessarily desirable. > +6. Start Guest1: > + > + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... > + > +7. Start Guest2: > + > + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... > + > +When the guest is shut down, the mediated matrix device may be removed. > + > +Using our example again, to remove the mediated matrix device $uuid1: > + > + /sys/devices/virtual/misc > + --- [vfio_ap] > + --------- [mdev_supported_types] > + ------------ [vfio_ap-passthrough] > + --------------- [devices] > + ------------------ [$uuid1] > + --------------------- activate > + --------------------- remove > + > + > + echo 0 > activate > + echo 1 > remove > + > + This will release all the AP queues for the mediated device and > + remove all of the mdev matrix device's sysfs structures. To > + recreate and reconfigure the mdev matrix device, all of the steps starting > + with step 4 will have to be performed again. > + > + It is not necessary to remove an mdev matrix device, but one may want to > + remove it if no guest will use it during the lifetime of the linux host. If > + the mdev matrix device is removed, one may want to unbind the AP queues the > + guest was using from the vfio_ap device driver and bind them back to the > + default driver. Alternatively, the AP queues can be configured for another > + mdev matrix (i.e., guest). In either case, one must take care to change the > + secure key configured for the domain to which the queue is connected. This seems prime for data leakage, extremely sensitive data at that. Who's responsibility is it to prevent this? Shouldn't closing the device reset the device, which should perhaps wipe any key configuration? > + > + > +Limitations > +=========== > +An admin needs to be careful when writing to sysfs files > +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver > +to which an AP queue is bound to. Once an AP queue is bound vfio_ap > +driver and assigned to a mediated device, admin should not re-assign the > +AP queues back to the default driver, because of technical limitations. > +The kernel does not prevent you from re-assigning from AP queues reserved > +for the vfio_ap driver back to default driver. Future enhancements in > +the ap_bus and vfio_ap are likely to introduce complete support for such > +dynamic reconfiguration. But until then be extremely careful. Why don't we have these enhancements now? Should the whole thing be marked experimental without them? This sounds similar to a vfio-pci case where a group with multiple devices could have device which is unused by the user unbound from vfio-pci and re-bound to a native host driver. We BUG_ON when this occurs to protect the data. Probably also need to evaluate updates to Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is supposed to interact with an 'activate' attribute. Thanks, Alex
On 07/27/2018 07:44 PM, Alex Williamson wrote: > On Thu, 26 Jul 2018 21:54:29 +0200 > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > ... >> +The process for reserving an AP queue for use by a KVM guest is: >> + >> +* The vfio-ap driver during its initialization will perform the following: >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >> + * Create the 'matrix' device in the 'vfio_ap' root >> + * Register the matrix device with the device core >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >> + newer) and to provide the vfio_ap driver's probe and remove callback >> + interfaces. The reason why older devices are not supported is because there >> + are no systems available on which to test. >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >> + mask to /sys/bus/ap/aqmask. >> + >> + For example to reserve all the AP Queues on the system for vfio_ap driver: >> + >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > > And this is a reasonable user syntax? ;) > We intend to provide tooling that makes this more human friendly. BTW echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. > ... >> + * mdev_attr_groups >> + This attribute group identifies the user-defined sysfs attributes of the >> + mediated device. When a device is registered with the VFIO mediated device >> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' >> + structure will be created in the mediated matrix device's directory. The >> + sysfs attributes for a mediated matrix device are: >> + * assign_adapter: >> + A write-only file for assigning an AP adapter to the mediated matrix >> + device. To assign an adapter, the APID of the adapter is written to the >> + file. >> + * assign_domain: >> + A write-only file for assigning an AP usage domain to the mediated matrix >> + device. To assign a domain, the APQI of the AP queue corresponding to a >> + usage domain is written to the file. >> + * matrix: >> + A read-only file for displaying the APQNs derived from the adapters and >> + domains assigned to the mediated matrix device. >> + * assign_control_domain: >> + A write-only file for assigning an AP control domain to the mediated >> + matrix device. To assign a control domain, the ID of a domain to be >> + controlled is written to the file. For the initial implementation, the set >> + of control domains will always include the set of usage domains, so it is >> + only necessary to assign control domains that are not also assigned as >> + usage domains. > > > How will the user know when this changes? What's the transition plan > to maintain compatibility when it does change? > I don't think this is supposed to change under the users feet. This document probably started out as a design document. While I'm personally in favor of managing the two separate, I think we need to commit to either-or and get rid of 'For the initial implementation,'. > ... >> +4. The administrator now needs to configure the matrixes for mediated >> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). >> + >> + This is how the matrix is configured for Guest1: >> + >> + echo 5 > assign_adapter >> + echo 6 > assign_adapter >> + echo 4 > assign_domain >> + echo 0xab > assign_domain >> + >> + For this implementation, all usage domains - i.e., domains assigned >> + via the assign_domain attribute file - will also be configured in the ADM >> + field of the KVM guest's CRYCB, so there is no need to assign control >> + domains here unless you want to assign control domains that are not >> + assigned as usage domains. >> + >> + If a mistake is made configuring an adapter, domain or control domain, >> + you can use the unassign_xxx files to unassign the adapter, domain or >> + control domain. > > Would it be more consistent with other sysfs entries to call these > "bind" and "unbind" rather than "assign" and "unassign"? > I'm not sure. Let me recap what is this about. First we partition our AP resources (queues) into 'to be used by the host' and 'to be used by the guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be used by the guests' among the guests. Each vfio ap_matrix represents a partition that corresponds to a single VM (guest). Such a partition can be seen as a device that grants access (usage or administration) to a set of queues. We can grant access to queues even if these are not currently available in the system. E.g. we can partition an AP card that is currently not online or even plugged. When The resources become available (e.g. new card plugged, or just 'configured'), the resources (should) become available at all levels if the authorization is right. I know 'bind' binding a device (that currently exists within the system) to a driver. The vfio_ap stuff seems different enough to seek a different name for it. . >> + >> + To display the matrix configuration for Guest1: >> + >> + cat matrix >> + >> + This is how the matrix is configured for Guest2: >> + >> + echo 5 > assign_adapter >> + echo 0x47 > assign_domain >> + echo 0xff > assign_domain >> + >> +5. The adminstrator now needs to activate the mediated devices. >> + echo 1 > activate > > Or optionally not. As in reply to cover letter, I don't think this > interface is sufficiently justified, or necessarily desirable. > You are right. Activation is not necessary because the open will try to auto-activate it. But honestly, I would recommend activating it in advance, so the admin can be sure the resources are claimed. In our current design we choose to give the administrator the freedom, to decide on his policy how does he want to manage the vfio_ap mdevs lifecycle. And to decide what guarantees does he want to have at a given point in time. With 'activate' both 'dynamically and on-demand of starting a VM' and 'create known vfio_ap mdevs at system bring up' are viable. If there were a consensus on how the life-cycle of the mdev devices is to be managed, it would be easier to not give this freedom to the admin. >> +6. Start Guest1: >> + >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... >> + >> +7. Start Guest2: >> + >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... >> + >> +When the guest is shut down, the mediated matrix device may be removed. >> + >> +Using our example again, to remove the mediated matrix device $uuid1: >> + >> + /sys/devices/virtual/misc >> + --- [vfio_ap] >> + --------- [mdev_supported_types] >> + ------------ [vfio_ap-passthrough] >> + --------------- [devices] >> + ------------------ [$uuid1] >> + --------------------- activate >> + --------------------- remove >> + >> + >> + echo 0 > activate >> + echo 1 > remove >> + >> + This will release all the AP queues for the mediated device and >> + remove all of the mdev matrix device's sysfs structures. To >> + recreate and reconfigure the mdev matrix device, all of the steps starting >> + with step 4 will have to be performed again. >> + >> + It is not necessary to remove an mdev matrix device, but one may want to >> + remove it if no guest will use it during the lifetime of the linux host. If >> + the mdev matrix device is removed, one may want to unbind the AP queues the >> + guest was using from the vfio_ap device driver and bind them back to the >> + default driver. Alternatively, the AP queues can be configured for another >> + mdev matrix (i.e., guest). In either case, one must take care to change the >> + secure key configured for the domain to which the queue is connected. > > This seems prime for data leakage, extremely sensitive data at that. > Who's responsibility is it to prevent this? Shouldn't closing the > device reset the device, which should perhaps wipe any key > configuration? > The documentation seems to be a bit outdated here. We do reset the device in the release callback and on VFIO_DEVICE_RESET ioctl. There might be some stuff that is not supposed to be reset by us, but that is fine (master keys managed via special infrastructure called the TKE). I'm not sure what is the meaning of last sentence of the paragraph in question ('change the secure key configured ...') though. Anyway we will have to update this. >> + >> + >> +Limitations >> +=========== >> +An admin needs to be careful when writing to sysfs files >> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver >> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap >> +driver and assigned to a mediated device, admin should not re-assign the >> +AP queues back to the default driver, because of technical limitations. >> +The kernel does not prevent you from re-assigning from AP queues reserved >> +for the vfio_ap driver back to default driver. Future enhancements in >> +the ap_bus and vfio_ap are likely to introduce complete support for such >> +dynamic reconfiguration. But until then be extremely careful. > > Why don't we have these enhancements now? Should the whole thing be > marked experimental without them? This sounds similar to a vfio-pci > case where a group with multiple devices could have device which is > unused by the user unbound from vfio-pci and re-bound to a native host > driver. We BUG_ON when this occurs to protect the data. I think, marking vfio_ap as experimental is probably a good idea. I think we can do these enhancements later. Of course, having everything right away is the best, but the problem with that is that it takes (more) time. > > Probably also need to evaluate updates to > Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is > supposed to interact with an 'activate' attribute. Thanks, > I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev is necessary. I see it more like the other type specific attributes. For instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to 'control the behavior of the vGPU, such as the frame rate limiter...'. AFAIK the activate and the assign/unassign attributes are because we can not do much with homogeneous vfio_ap mdevs. The only safe default for a vfio_ap mdev is all zero masks, that can't use any resources, that is essentially useless. And this is how we construct the vfio_ap mdev devices. In that sense we could as well construct them 'activated', as empty won't result in any kind of conflicts (and allow the admin to de-activate these so we are still flexible about the lifecycle management). Regards, Halil >
On Wed, 1 Aug 2018 13:12:05 +0200 Halil Pasic <pasic@linux.ibm.com> wrote: > On 07/27/2018 07:44 PM, Alex Williamson wrote: > > On Thu, 26 Jul 2018 21:54:29 +0200 > > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > ... > >> +The process for reserving an AP queue for use by a KVM guest is: > >> + > >> +* The vfio-ap driver during its initialization will perform the following: > >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap > >> + * Create the 'matrix' device in the 'vfio_ap' root > >> + * Register the matrix device with the device core > >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and > >> + newer) and to provide the vfio_ap driver's probe and remove callback > >> + interfaces. The reason why older devices are not supported is because there > >> + are no systems available on which to test. > >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be > >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain > >> + mask to /sys/bus/ap/aqmask. > >> + > >> + For example to reserve all the AP Queues on the system for vfio_ap driver: > >> + > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > > > > And this is a reasonable user syntax? ;) > > > > We intend to provide tooling that makes this more human friendly. BTW > echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. This should be reasonably configurable even without tooling. If you want to have any APQNs reserved to the default drivers, you'll end up counting positions, and with the mask lengths here, it's like configuring zfcp, only worse. See also my comments to the patch introducing the masks.
On 08/01/2018 01:27 PM, Cornelia Huck wrote: > On Wed, 1 Aug 2018 13:12:05 +0200 > Halil Pasic <pasic@linux.ibm.com> wrote: > >> On 07/27/2018 07:44 PM, Alex Williamson wrote: >>> On Thu, 26 Jul 2018 21:54:29 +0200 >>> Christian Borntraeger <borntraeger@de.ibm.com> wrote: >>> ... >>>> +The process for reserving an AP queue for use by a KVM guest is: >>>> + >>>> +* The vfio-ap driver during its initialization will perform the following: >>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >>>> + * Create the 'matrix' device in the 'vfio_ap' root >>>> + * Register the matrix device with the device core >>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >>>> + newer) and to provide the vfio_ap driver's probe and remove callback >>>> + interfaces. The reason why older devices are not supported is because there >>>> + are no systems available on which to test. >>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >>>> + mask to /sys/bus/ap/aqmask. >>>> + >>>> + For example to reserve all the AP Queues on the system for vfio_ap driver: >>>> + >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask >>> >>> And this is a reasonable user syntax? ;) >>> >> >> We intend to provide tooling that makes this more human friendly. BTW >> echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. > > This should be reasonably configurable even without tooling. If you > want to have any APQNs reserved to the default drivers, you'll end up > counting positions, and with the mask lengths here, it's like > configuring zfcp, only worse. > > See also my comments to the patch introducing the masks. > Yes, I've seen those comments. I did not feel like meddling in. These questions are best discussed with Harald. Harald is still on vacation however. My opinion is that both approaches (bitmap and list-based whatever) have their pros and cons. Halil
On Wed, 1 Aug 2018 13:12:05 +0200 Halil Pasic <pasic@linux.ibm.com> wrote: > On 07/27/2018 07:44 PM, Alex Williamson wrote: > > On Thu, 26 Jul 2018 21:54:29 +0200 > > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > ... > >> +The process for reserving an AP queue for use by a KVM guest is: > >> + > >> +* The vfio-ap driver during its initialization will perform the following: > >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap > >> + * Create the 'matrix' device in the 'vfio_ap' root > >> + * Register the matrix device with the device core > >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and > >> + newer) and to provide the vfio_ap driver's probe and remove callback > >> + interfaces. The reason why older devices are not supported is because there > >> + are no systems available on which to test. > >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be > >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain > >> + mask to /sys/bus/ap/aqmask. > >> + > >> + For example to reserve all the AP Queues on the system for vfio_ap driver: > >> + > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > > > > And this is a reasonable user syntax? ;) > > > > We intend to provide tooling that makes this more human friendly. BTW > echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. If tooling is required, perhaps the interface is wrong by design. Maybe the interface should accept a list of bits, for example: echo 255,0 > /sys/bus/ap/aqmask vs echo 0x8000000000000000000000000000000000000000000000000000000000000001 > /sys/bus/ap/aqmask > > ... > >> + * mdev_attr_groups > >> + This attribute group identifies the user-defined sysfs attributes of the > >> + mediated device. When a device is registered with the VFIO mediated device > >> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' > >> + structure will be created in the mediated matrix device's directory. The > >> + sysfs attributes for a mediated matrix device are: > >> + * assign_adapter: > >> + A write-only file for assigning an AP adapter to the mediated matrix > >> + device. To assign an adapter, the APID of the adapter is written to the > >> + file. > >> + * assign_domain: > >> + A write-only file for assigning an AP usage domain to the mediated matrix > >> + device. To assign a domain, the APQI of the AP queue corresponding to a > >> + usage domain is written to the file. > >> + * matrix: > >> + A read-only file for displaying the APQNs derived from the adapters and > >> + domains assigned to the mediated matrix device. > >> + * assign_control_domain: > >> + A write-only file for assigning an AP control domain to the mediated > >> + matrix device. To assign a control domain, the ID of a domain to be > >> + controlled is written to the file. For the initial implementation, the set > >> + of control domains will always include the set of usage domains, so it is > >> + only necessary to assign control domains that are not also assigned as > >> + usage domains. > > > > > > How will the user know when this changes? What's the transition plan > > to maintain compatibility when it does change? > > > > I don't think this is supposed to change under the users feet. This document > probably started out as a design document. While I'm personally in favor of > managing the two separate, I think we need to commit to either-or and > get rid of 'For the initial implementation,'. +1 > > ... > >> +4. The administrator now needs to configure the matrixes for mediated > >> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). > >> + > >> + This is how the matrix is configured for Guest1: > >> + > >> + echo 5 > assign_adapter > >> + echo 6 > assign_adapter > >> + echo 4 > assign_domain > >> + echo 0xab > assign_domain > >> + > >> + For this implementation, all usage domains - i.e., domains assigned > >> + via the assign_domain attribute file - will also be configured in the ADM > >> + field of the KVM guest's CRYCB, so there is no need to assign control > >> + domains here unless you want to assign control domains that are not > >> + assigned as usage domains. > >> + > >> + If a mistake is made configuring an adapter, domain or control domain, > >> + you can use the unassign_xxx files to unassign the adapter, domain or > >> + control domain. > > > > Would it be more consistent with other sysfs entries to call these > > "bind" and "unbind" rather than "assign" and "unassign"? > > > > I'm not sure. Let me recap what is this about. First we partition our AP > resources (queues) into 'to be used by the host' and 'to be used by the > guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be > used by the guests' among the guests. Each vfio ap_matrix represents a > partition that corresponds to a single VM (guest). Such a partition can > be seen as a device that grants access (usage or administration) to a set > of queues. We can grant access to queues even if these are not currently > available in the system. E.g. we can partition an AP card that is > currently not online or even plugged. When The resources become available > (e.g. new card plugged, or just 'configured'), the resources (should) > become available at all levels if the authorization is right. I know > 'bind' binding a device (that currently exists within the system) to a > driver. The vfio_ap stuff seems different enough to seek a different name > for it. As discussed elsewhere, I'm not in favor of allowing mdev devices to exist which rely on resources that are not present or in-use elsewhere. If we remove that concept, then we are actually binding and unbinding resources to the mdev device. > >> + > >> + To display the matrix configuration for Guest1: > >> + > >> + cat matrix > >> + > >> + This is how the matrix is configured for Guest2: > >> + > >> + echo 5 > assign_adapter > >> + echo 0x47 > assign_domain > >> + echo 0xff > assign_domain > >> + > >> +5. The adminstrator now needs to activate the mediated devices. > >> + echo 1 > activate > > > > Or optionally not. As in reply to cover letter, I don't think this > > interface is sufficiently justified, or necessarily desirable. > > > > You are right. Activation is not necessary because the open will try to > auto-activate it. But honestly, I would recommend activating it in > advance, so the admin can be sure the resources are claimed. This can also be solved by claiming the resource at the point at which it is bound/assigned/attached to the mdev device (ie. at any point where we introduce new intersections in the matrix). > In our current design we choose to give the administrator the freedom, to > decide on his policy how does he want to manage the vfio_ap mdevs > lifecycle. And to decide what guarantees does he want to have at a given > point in time. With 'activate' both 'dynamically and on-demand of > starting a VM' and 'create known vfio_ap mdevs at system bring up' are > viable. If there were a consensus on how the life-cycle of the mdev > devices is to be managed, it would be easier to not give this freedom to > the admin. Please, this is not a freedom of the admin issue, this simply defines the point in time at which bound/assigned/attached resources are committed to the mdev device. In the model proposed here, the activate attribute allows resources to be over-committed, placing the burden on the management tools to implicitly understand which mdev devices can be used simultaneously. With the existing model, resources are committed on creation. I'll agree that such a model doesn't necessarily apply to these new devices, but I'd argue the next logical step is that resources are committed as they're attached such that the mdev device is always in a usable state. AFAICT, the same "freedom" is available to the admin by simply scripting mdev creation and assembly without a change to the mdev interface. > >> +6. Start Guest1: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... > >> + > >> +7. Start Guest2: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... > >> + > >> +When the guest is shut down, the mediated matrix device may be removed. > >> + > >> +Using our example again, to remove the mediated matrix device $uuid1: > >> + > >> + /sys/devices/virtual/misc > >> + --- [vfio_ap] > >> + --------- [mdev_supported_types] > >> + ------------ [vfio_ap-passthrough] > >> + --------------- [devices] > >> + ------------------ [$uuid1] > >> + --------------------- activate > >> + --------------------- remove > >> + > >> + > >> + echo 0 > activate > >> + echo 1 > remove > >> + > >> + This will release all the AP queues for the mediated device and > >> + remove all of the mdev matrix device's sysfs structures. To > >> + recreate and reconfigure the mdev matrix device, all of the steps starting > >> + with step 4 will have to be performed again. > >> + > >> + It is not necessary to remove an mdev matrix device, but one may want to > >> + remove it if no guest will use it during the lifetime of the linux host. If > >> + the mdev matrix device is removed, one may want to unbind the AP queues the > >> + guest was using from the vfio_ap device driver and bind them back to the > >> + default driver. Alternatively, the AP queues can be configured for another > >> + mdev matrix (i.e., guest). In either case, one must take care to change the > >> + secure key configured for the domain to which the queue is connected. > > > > This seems prime for data leakage, extremely sensitive data at that. > > Who's responsibility is it to prevent this? Shouldn't closing the > > device reset the device, which should perhaps wipe any key > > configuration? > > > > The documentation seems to be a bit outdated here. We do reset the device > in the release callback and on VFIO_DEVICE_RESET ioctl. There might be some > stuff that is not supposed to be reset by us, but that is fine (master keys > managed via special infrastructure called the TKE). I'm not sure what is the > meaning of last sentence of the paragraph in question ('change the secure > key configured ...') though. Anyway we will have to update this. > > >> + > >> + > >> +Limitations > >> +=========== > >> +An admin needs to be careful when writing to sysfs files > >> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver > >> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap > >> +driver and assigned to a mediated device, admin should not re-assign the > >> +AP queues back to the default driver, because of technical limitations. > >> +The kernel does not prevent you from re-assigning from AP queues reserved > >> +for the vfio_ap driver back to default driver. Future enhancements in > >> +the ap_bus and vfio_ap are likely to introduce complete support for such > >> +dynamic reconfiguration. But until then be extremely careful. > > > > Why don't we have these enhancements now? Should the whole thing be > > marked experimental without them? This sounds similar to a vfio-pci > > case where a group with multiple devices could have device which is > > unused by the user unbound from vfio-pci and re-bound to a native host > > driver. We BUG_ON when this occurs to protect the data. > > I think, marking vfio_ap as experimental is probably a good idea. I think > we can do these enhancements later. Of course, having everything right away > is the best, but the problem with that is that it takes (more) time. > > > > > Probably also need to evaluate updates to > > Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is > > supposed to interact with an 'activate' attribute. Thanks, > > > > I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev > is necessary. I see it more like the other type specific attributes. For > instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and > https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm > suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to > 'control the behavior of the vGPU, such as the frame rate limiter...'. I disagree, the difference is that the existing vendor attributes can modify the behavior of the mdev device, but are not part of the standard mdev sysfs ABI. You recommend above that 'activate' should be used prior to open(2) in order to commit mdev resources. That's a fundamental change in the management of the device, which is only obscured by the fact that open(2) does an implicit activate. Let's please just drop this whole notion of 'activate'. > AFAIK the activate and the assign/unassign attributes are because we can > not do much with homogeneous vfio_ap mdevs. The only safe default for > a vfio_ap mdev is all zero masks, that can't use any resources, that > is essentially useless. And this is how we construct the vfio_ap mdev > devices. In that sense we could as well construct them 'activated', > as empty won't result in any kind of conflicts (and allow the admin > to de-activate these so we are still flexible about the lifecycle > management). I agree that a zero mask is the only safe default for creation and that vendor specific attributes are necessary to build the configuration up to something that provides some usable value (unless we decide to expand the create interface). As above though, I disagree about what 'activate' does or does not provide. I don't see that it adds and flexibility in lifecycle management, at least not in a way that's compatible with existing mdev devices or in a way to facilitates upper level software management. It seems to only allow the admin to avoid dynamically creating and attaching resources to mdevs, at the cost of manage layers needing to account for over-commitment. Thanks, Alex
On 07/27/2018 01:44 PM, Alex Williamson wrote: > On Thu, 26 Jul 2018 21:54:29 +0200 > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > ... >> +The process for reserving an AP queue for use by a KVM guest is: >> + >> +* The vfio-ap driver during its initialization will perform the following: >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >> + * Create the 'matrix' device in the 'vfio_ap' root >> + * Register the matrix device with the device core >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >> + newer) and to provide the vfio_ap driver's probe and remove callback >> + interfaces. The reason why older devices are not supported is because there >> + are no systems available on which to test. >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >> + mask to /sys/bus/ap/aqmask. >> + >> + For example to reserve all the AP Queues on the system for vfio_ap driver: >> + >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > And this is a reasonable user syntax? ;) I am not a fan of writing a 256-bit bitmask as such. It is entirely too difficult to figure out which character to specify to indicate a bit number to be set for anything past the first few characters. Of course, tooling could be used to accomplish this task, but the sysfs interfaces should probably be user-friendly. A comma-separated list could be used; for example to set reserve adapters 21, 99 and 249: echo 21,99,249 > /sys/bus/ap/apmask The problem with this is how do we indicate that all bits are to be cleared? Do you have any suggestions for a reasonable user syntax? > > ... >> + * mdev_attr_groups >> + This attribute group identifies the user-defined sysfs attributes of the >> + mediated device. When a device is registered with the VFIO mediated device >> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' >> + structure will be created in the mediated matrix device's directory. The >> + sysfs attributes for a mediated matrix device are: >> + * assign_adapter: >> + A write-only file for assigning an AP adapter to the mediated matrix >> + device. To assign an adapter, the APID of the adapter is written to the >> + file. >> + * assign_domain: >> + A write-only file for assigning an AP usage domain to the mediated matrix >> + device. To assign a domain, the APQI of the AP queue corresponding to a >> + usage domain is written to the file. >> + * matrix: >> + A read-only file for displaying the APQNs derived from the adapters and >> + domains assigned to the mediated matrix device. >> + * assign_control_domain: >> + A write-only file for assigning an AP control domain to the mediated >> + matrix device. To assign a control domain, the ID of a domain to be >> + controlled is written to the file. For the initial implementation, the set >> + of control domains will always include the set of usage domains, so it is >> + only necessary to assign control domains that are not also assigned as >> + usage domains. > > How will the user know when this changes? What's the transition plan > to maintain compatibility when it does change? I'm sorry, I don't understand the question. When you say user, about whom are you talking; the person doing the assignments? What changes are you talking about? > > ... >> +4. The administrator now needs to configure the matrixes for mediated >> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). >> + >> + This is how the matrix is configured for Guest1: >> + >> + echo 5 > assign_adapter >> + echo 6 > assign_adapter >> + echo 4 > assign_domain >> + echo 0xab > assign_domain >> + >> + For this implementation, all usage domains - i.e., domains assigned >> + via the assign_domain attribute file - will also be configured in the ADM >> + field of the KVM guest's CRYCB, so there is no need to assign control >> + domains here unless you want to assign control domains that are not >> + assigned as usage domains. >> + >> + If a mistake is made configuring an adapter, domain or control domain, >> + you can use the unassign_xxx files to unassign the adapter, domain or >> + control domain. > Would it be more consistent with other sysfs entries to call these > "bind" and "unbind" rather than "assign" and "unassign Aren't the bind and unbind sysfs interfaces typically associated with binding devices to and unbinding devices from a device driver? In this case we are talking about configuring a mediated device's AP matrix - i.e., setting bits identifying the adapters, domains and control domains defined for the mediated device (i.e., a guest). Maybe there are better words we could use than assign/unassign, but I find the use of bind/unbind to be confusing. I'm open to suggestions. > >> + >> + To display the matrix configuration for Guest1: >> + >> + cat matrix >> + >> + This is how the matrix is configured for Guest2: >> + >> + echo 5 > assign_adapter >> + echo 0x47 > assign_domain >> + echo 0xff > assign_domain >> + >> +5. The adminstrator now needs to activate the mediated devices. >> + echo 1 > activate > Or optionally not. As in reply to cover letter, I don't think this > interface is sufficiently justified, or necessarily desirable. To clarify your objection, let me state what I think you are saying. 1. You do not think there is a good reason to assign duplicate APQNs - i.e., the cross product of all adapter IDs and domain IDs assigned to the mdev. 2. All mdevs should be validly configured for use regardless of when guests using them are started. In other words, libvirt should be able to depend on the fact that an mdev is ready for use at all times. I will discuss this with the team when we meet on Thursday and get back to you. If I am mistaken in my analysis of your concern, please clarify. I would like to mention that when a guest using an mdev starts, the mdev will be activated if not already activated. The guest will not start if another activated mdev is using an APQN assigned to the mdev of the guest being started. > >> +6. Start Guest1: >> + >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... >> + >> +7. Start Guest2: >> + >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... >> + >> +When the guest is shut down, the mediated matrix device may be removed. >> + >> +Using our example again, to remove the mediated matrix device $uuid1: >> + >> + /sys/devices/virtual/misc >> + --- [vfio_ap] >> + --------- [mdev_supported_types] >> + ------------ [vfio_ap-passthrough] >> + --------------- [devices] >> + ------------------ [$uuid1] >> + --------------------- activate >> + --------------------- remove >> + >> + >> + echo 0 > activate >> + echo 1 > remove >> + >> + This will release all the AP queues for the mediated device and >> + remove all of the mdev matrix device's sysfs structures. To >> + recreate and reconfigure the mdev matrix device, all of the steps starting >> + with step 4 will have to be performed again. >> + >> + It is not necessary to remove an mdev matrix device, but one may want to >> + remove it if no guest will use it during the lifetime of the linux host. If >> + the mdev matrix device is removed, one may want to unbind the AP queues the >> + guest was using from the vfio_ap device driver and bind them back to the >> + default driver. Alternatively, the AP queues can be configured for another >> + mdev matrix (i.e., guest). In either case, one must take care to change the >> + secure key configured for the domain to which the queue is connected. > This seems prime for data leakage, extremely sensitive data at that. > Who's responsibility is it to prevent this? Shouldn't closing the > device reset the device, which should perhaps wipe any key > configuration? The device is zeroized (reset) when it is removed. The zeroize instruction - to the best of my knowledge - does nothing with the secure key for a given queue. There is also no instruction that I know of to clear keys, so it would appear that this is left to the system administrator. > >> + >> + >> +Limitations >> +=========== >> +An admin needs to be careful when writing to sysfs files >> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver >> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap >> +driver and assigned to a mediated device, admin should not re-assign the >> +AP queues back to the default driver, because of technical limitations. >> +The kernel does not prevent you from re-assigning from AP queues reserved >> +for the vfio_ap driver back to default driver. Future enhancements in >> +the ap_bus and vfio_ap are likely to introduce complete support for such >> +dynamic reconfiguration. But until then be extremely careful. > Why don't we have these enhancements now? Should the whole thing be > marked experimental without them? This sounds similar to a vfio-pci > case where a group with multiple devices could have device which is > unused by the user unbound from vfio-pci and re-bound to a native host > driver. We BUG_ON when this occurs to protect the data. We have customers that have been waiting a long time and are very anxious to have guest access to AP devices. In the interest of reducing time to market, we have decided to simplify the initial implementation as much as possible. I wouldn't necessarily call this experimental, but view it more as the minimal viable product. As far as BUG_ON, I will discuss this with the crypto team at our meeting on Thursday and get back to you. > > Probably also need to evaluate updates to > Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is > supposed to interact with an 'activate' attribute. Thanks, As it stands, libvirt will not create or modify the attributes of an mdev. If a guest is started under libvirt and its mdev has not been activated, it will be activated when the mdev fd is opened. Of course, if another activated mdev is assigned an APQN assigned to the mdev of the guest being started, the guest will be terminated. > > Alex >
On 08/01/2018 08:08 PM, Alex Williamson wrote: > On Wed, 1 Aug 2018 13:12:05 +0200 > Halil Pasic <pasic@linux.ibm.com> wrote: > >> On 07/27/2018 07:44 PM, Alex Williamson wrote: >>> On Thu, 26 Jul 2018 21:54:29 +0200 >>> Christian Borntraeger <borntraeger@de.ibm.com> wrote: >>> ... >>>> +The process for reserving an AP queue for use by a KVM guest is: >>>> + >>>> +* The vfio-ap driver during its initialization will perform the following: >>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >>>> + * Create the 'matrix' device in the 'vfio_ap' root >>>> + * Register the matrix device with the device core >>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >>>> + newer) and to provide the vfio_ap driver's probe and remove callback >>>> + interfaces. The reason why older devices are not supported is because there >>>> + are no systems available on which to test. >>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >>>> + mask to /sys/bus/ap/aqmask. >>>> + >>>> + For example to reserve all the AP Queues on the system for vfio_ap driver: >>>> + >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask >>> >>> And this is a reasonable user syntax? ;) >>> >> >> We intend to provide tooling that makes this more human friendly. BTW >> echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. > > If tooling is required, perhaps the interface is wrong by design. > Maybe the interface should accept a list of bits, for example: > > echo 255,0 > /sys/bus/ap/aqmask > > vs > > echo 0x8000000000000000000000000000000000000000000000000000000000000001 > /sys/bus/ap/aqmask > Sorry I'm not the right person to have this discussion with. Just wanted to provide some background as someone who was there when this was discussed internally. Harald should be back next week. [..] >>> Would it be more consistent with other sysfs entries to call these >>> "bind" and "unbind" rather than "assign" and "unassign"? >>> >> >> I'm not sure. Let me recap what is this about. First we partition our AP >> resources (queues) into 'to be used by the host' and 'to be used by the >> guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be >> used by the guests' among the guests. Each vfio ap_matrix represents a >> partition that corresponds to a single VM (guest). Such a partition can >> be seen as a device that grants access (usage or administration) to a set >> of queues. We can grant access to queues even if these are not currently >> available in the system. E.g. we can partition an AP card that is >> currently not online or even plugged. When The resources become available >> (e.g. new card plugged, or just 'configured'), the resources (should) >> become available at all levels if the authorization is right. I know >> 'bind' binding a device (that currently exists within the system) to a >> driver. The vfio_ap stuff seems different enough to seek a different name >> for it. > > As discussed elsewhere, I'm not in favor of allowing mdev devices to > exist which rely on resources that are not present or in-use > elsewhere. If we remove that concept, then we are actually binding and > unbinding resources to the mdev device. > I understand and respect your point. I'm torn between the mdev perspective and the perspective given by the architecture. On s390x KVM is a second and higher level hyperviseor, and I prefer if the things are consistent among the different levels of virtualization (if possible). What I described (granting and gaining access to AP resources that are not a part of the configuration at a certain moment) is how things work at LPAR level. In use elsewhere is a taboo of course, and we check for that. Here we are completely on the same page. But I fail to see why 'not (yet) present but guaranteed to be 'ours' once it shows up (if)' has to be excluded on KVM level if it is valid on LPAR level. Can you help me understand that aspect? >>>> + >>>> + To display the matrix configuration for Guest1: >>>> + >>>> + cat matrix >>>> + >>>> + This is how the matrix is configured for Guest2: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 0x47 > assign_domain >>>> + echo 0xff > assign_domain >>>> + >>>> +5. The adminstrator now needs to activate the mediated devices. >>>> + echo 1 > activate >>> >>> Or optionally not. As in reply to cover letter, I don't think this >>> interface is sufficiently justified, or necessarily desirable. >>> >> >> You are right. Activation is not necessary because the open will try to >> auto-activate it. But honestly, I would recommend activating it in >> advance, so the admin can be sure the resources are claimed. > > This can also be solved by claiming the resource at the point at which > it is bound/assigned/attached to the mdev device (ie. at any point > where we introduce new intersections in the matrix). > See below. >> In our current design we choose to give the administrator the freedom, to >> decide on his policy how does he want to manage the vfio_ap mdevs >> lifecycle. And to decide what guarantees does he want to have at a given >> point in time. With 'activate' both 'dynamically and on-demand of >> starting a VM' and 'create known vfio_ap mdevs at system bring up' are >> viable. If there were a consensus on how the life-cycle of the mdev >> devices is to be managed, it would be easier to not give this freedom to >> the admin. > > Please, this is not a freedom of the admin issue, this simply defines > the point in time at which bound/assigned/attached resources are > committed to the mdev device. In the model proposed here, the activate > attribute allows resources to be over-committed, placing the burden on > the management tools to implicitly understand which mdev devices can be > used simultaneously. We have never seen this as the responsibility of the management software. The idea was that this would be another thing were management software does not have complete understanding. If the admin decides to do things that way activating the vfio_ap mdev was just supposed to fail. > With the existing model, resources are committed > on creation. I'll agree that such a model doesn't necessarily apply to > these new devices, but I'd argue the next logical step is that > resources are committed as they're attached such that the mdev device > is always in a usable state. Having the device in a not usable state is indeed very ugly. When we were discussing this internally I was arguing for an atomic bulk update of the matrix that either succeeds or fails. Makes things easier to think about. If we go by the acquire the diff on each assign/(bind) operation we kind of have an transaction to get in the desired state. Deadlocks may happen if no protocol is established, and similar. Please also consider we don't add one resource at a time. If we have the the adapters 0, 1 and 2 assigned, but no domains, and we assign domain 0 we 'get' not one but three queues. The 'activate' is a quite new, and in my opinion somewhat messy concept that was proposed during the workshop mentioned by Christian and introduced in this round. One of the things that is associated with 'activate' is kind of an 'edit' and 'commit/save' workflow. Especially thinking about changing the authorizations (the matrix) for a running VM (which is not supported yet), not being able to do it in one go does not sound like a good idea to me. But that would effectively mean that we would have to remain functional with the 'old' set of resources (initially empty). The we could maintain device is operational with the claimed resources an invariant. > AFAICT, the same "freedom" is available > to the admin by simply scripting mdev creation and assembly without a > change to the mdev interface. > You mean the qemu hooks in libvirt I guess. I don't quite understand. AFAIU if we use the hooks we commit to populate (and create) the vfio_ap mdevs dynamically on demand. AFAIU we can not have all three of the following at the same time 1) mutually overlapping definitions of vfio_ap devcies 2) construct and populate mdevs according to each of these definitions e.g at system bring up time. 3) Maintain all 'resources claimed' as an invariant for a vfio_ap mdev. Your proposal resolves this by compromising on: not creating and populating a vfio_ap device for each definition, but only for the ones that are actually needed. I used to argue along the same lines, but I'm not sure how acceptable is this hooks based solution to our management software people. [..] >> >> I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev >> is necessary. I see it more like the other type specific attributes. For >> instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and >> https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm >> suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to >> 'control the behavior of the vGPU, such as the frame rate limiter...'. > > I disagree, the difference is that the existing vendor attributes can > modify the behavior of the mdev device, but are not part of the > standard mdev sysfs ABI. You recommend above that 'activate' should be > used prior to open(2) in order to commit mdev resources. That's a > fundamental change in the management of the device, which is only > obscured by the fact that open(2) does an implicit activate. I agree, you have a point. Requiring to do an activate before doing an open is a heavy interference with the mdev ABI. And making the open fail is ugly indeed. > Let's please just drop this whole notion of 'activate'. > >> AFAIK the activate and the assign/unassign attributes are because we can >> not do much with homogeneous vfio_ap mdevs. The only safe default for >> a vfio_ap mdev is all zero masks, that can't use any resources, that >> is essentially useless. And this is how we construct the vfio_ap mdev >> devices. In that sense we could as well construct them 'activated', >> as empty won't result in any kind of conflicts (and allow the admin >> to de-activate these so we are still flexible about the lifecycle >> management). > > I agree that a zero mask is the only safe default for creation and that > vendor specific attributes are necessary to build the configuration up > to something that provides some usable value (unless we decide to > expand the create interface). As above though, I disagree about what > 'activate' does or does not provide. I don't see that it adds and > flexibility in lifecycle management, at least not in a way that's > compatible with existing mdev devices or in a way to facilitates upper > level software management. It seems to only allow the admin to avoid > dynamically creating and attaching resources to mdevs, at the cost of > manage layers needing to account for over-commitment. Thanks, I tried to clarify my views on some of these points (responsibility of management layers regarding over-commitment, and the flexibility in lifecycle management). The truth is, I used to argue against 'activate', and I would be very happy to forget about it if we can come up with something better. It's a bit unfortunate that I will be on a vacation (I'm already on vacation) for some time. I will try to stay tuned though. Thanks for the nice and productive discussion. We (and especially I) learned a lot about your take on the problem. Cheers, Halil
On Wed, 1 Aug 2018 19:05:35 -0400 Tony Krowiak <akrowiak@linux.ibm.com> wrote: > On 07/27/2018 01:44 PM, Alex Williamson wrote: > > On Thu, 26 Jul 2018 21:54:29 +0200 > > Christian Borntraeger <borntraeger@de.ibm.com> wrote: > > ... > >> +The process for reserving an AP queue for use by a KVM guest is: > >> + > >> +* The vfio-ap driver during its initialization will perform the following: > >> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap > >> + * Create the 'matrix' device in the 'vfio_ap' root > >> + * Register the matrix device with the device core > >> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and > >> + newer) and to provide the vfio_ap driver's probe and remove callback > >> + interfaces. The reason why older devices are not supported is because there > >> + are no systems available on which to test. > >> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be > >> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain > >> + mask to /sys/bus/ap/aqmask. > >> + > >> + For example to reserve all the AP Queues on the system for vfio_ap driver: > >> + > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask > >> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask > > And this is a reasonable user syntax? ;) > > I am not a fan of writing a 256-bit bitmask as such. It is entirely too > difficult to > figure out which character to specify to indicate a bit number to be set > for anything > past the first few characters. Of course, tooling could be used to > accomplish this > task, but the sysfs interfaces should probably be user-friendly. A > comma-separated > list could be used; for example to set reserve adapters 21, 99 and 249: > > echo 21,99,249 > /sys/bus/ap/apmask > > The problem with this is how do we indicate that all bits are to be > cleared? Do you > have any suggestions for a reasonable user syntax? Perhaps an empty write, just a carriage return, "null". I'd shy away from magic digits, like 256 to clear. > > ... > >> + * mdev_attr_groups > >> + This attribute group identifies the user-defined sysfs attributes of the > >> + mediated device. When a device is registered with the VFIO mediated device > >> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' > >> + structure will be created in the mediated matrix device's directory. The > >> + sysfs attributes for a mediated matrix device are: > >> + * assign_adapter: > >> + A write-only file for assigning an AP adapter to the mediated matrix > >> + device. To assign an adapter, the APID of the adapter is written to the > >> + file. > >> + * assign_domain: > >> + A write-only file for assigning an AP usage domain to the mediated matrix > >> + device. To assign a domain, the APQI of the AP queue corresponding to a > >> + usage domain is written to the file. > >> + * matrix: > >> + A read-only file for displaying the APQNs derived from the adapters and > >> + domains assigned to the mediated matrix device. > >> + * assign_control_domain: > >> + A write-only file for assigning an AP control domain to the mediated > >> + matrix device. To assign a control domain, the ID of a domain to be > >> + controlled is written to the file. For the initial implementation, the set > >> + of control domains will always include the set of usage domains, so it is > >> + only necessary to assign control domains that are not also assigned as > >> + usage domains. > > > > How will the user know when this changes? What's the transition plan > > to maintain compatibility when it does change? > > I'm sorry, I don't understand the question. When you say user, about > whom are > you talking; the person doing the assignments? What changes are you talking > about? The user would be the person interaction with the sysfs attribute assign_control_domain where it states "[f]or the initial implementation, the set of control domains will always include the set of usage domains, so it is only necessary to assign control domains that are not also assigned as usage domains." I infer from the usage of "initial implementation" that this behavior may not always be the case and therefore if it's not final what is the transition plan so that a user interacting with this attribute can know the current behavior. > > ... > >> +4. The administrator now needs to configure the matrixes for mediated > >> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). > >> + > >> + This is how the matrix is configured for Guest1: > >> + > >> + echo 5 > assign_adapter > >> + echo 6 > assign_adapter > >> + echo 4 > assign_domain > >> + echo 0xab > assign_domain > >> + > >> + For this implementation, all usage domains - i.e., domains assigned > >> + via the assign_domain attribute file - will also be configured in the ADM > >> + field of the KVM guest's CRYCB, so there is no need to assign control > >> + domains here unless you want to assign control domains that are not > >> + assigned as usage domains. > >> + > >> + If a mistake is made configuring an adapter, domain or control domain, > >> + you can use the unassign_xxx files to unassign the adapter, domain or > >> + control domain. > > Would it be more consistent with other sysfs entries to call these > > "bind" and "unbind" rather than "assign" and "unassign > > Aren't the bind and unbind sysfs interfaces typically associated with > binding devices to and unbinding devices from a device driver? Yes > In this > case we are talking about configuring a mediated device's AP matrix - i.e., > setting bits identifying the adapters, domains and control domains defined > for the mediated device (i.e., a guest). Maybe there are better words we > could use than assign/unassign, but I find the use of bind/unbind to be > confusing. I'm open to suggestions. If we remove the 'activate' interface and dedicate resources to the mdev device at the point where we create an intersection in the matrix, wouldn't that resource be "bound" to the mdev at that time? This is not a strong issue for me, I just thought "assign" seemed inconsistent as the operation felt more like a traditional "bind". > >> + > >> + To display the matrix configuration for Guest1: > >> + > >> + cat matrix > >> + > >> + This is how the matrix is configured for Guest2: > >> + > >> + echo 5 > assign_adapter > >> + echo 0x47 > assign_domain > >> + echo 0xff > assign_domain > >> + > >> +5. The adminstrator now needs to activate the mediated devices. > >> + echo 1 > activate > > Or optionally not. As in reply to cover letter, I don't think this > > interface is sufficiently justified, or necessarily desirable. > > To clarify your objection, let me state what I think you are saying. > > 1. You do not think there is a good reason to assign duplicate APQNs > - i.e., the cross product of all adapter IDs and domain IDs assigned > to the mdev. I think the cross product of adapters and domains is a basic part of the design here, what I object to is that mdevs can exist with overlapping cross products and which one can be activated is dependent on activation ordering. For example, if I have the following mdevs: {adapters}{domains} A: {0,1}{0,1} B: {0}{0} C: {0}{1} D: {1}{0} E: {1}{1} Who is supposed to understand and debug that A cannot be activated while any of B/C/D/E are activated? What does the debugging process look like? If the mdev is not explicitly activated, only opened, how do we determine the fault is from a resource conflict on the mdev definition? On the other hand if the cross product is committed at the time it's specified, then a scenario where E already exists and we're trying to create A might look like: 1. Specify adapter mask of {0,1} 2. Specify domain mask of {0,1} <-- write(2) fails with -EBUSY Here we know that one or more of the domains we've specified are not available on the set of adapters we've configured. > 2. All mdevs should be validly configured for use regardless of when > guests using them are started. In other words, libvirt should be > able to depend on the fact that an mdev is ready for use at all > times. Yes, an mdev should represent committed resources. We support dynamic creation and removal if the resources need to be freed for the host or another mdev configuration. > I will discuss this with the team when we meet on Thursday and get > back to you. If I am mistaken in my analysis of your concern, please > clarify. > > I would like to mention that when a guest using an mdev starts, > the mdev will be activated if not already activated. The guest > will not start if another activated mdev is using an APQN assigned > to the mdev of the guest being started. Understood. > >> +6. Start Guest1: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... > >> + > >> +7. Start Guest2: > >> + > >> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ > >> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... > >> + > >> +When the guest is shut down, the mediated matrix device may be removed. > >> + > >> +Using our example again, to remove the mediated matrix device $uuid1: > >> + > >> + /sys/devices/virtual/misc > >> + --- [vfio_ap] > >> + --------- [mdev_supported_types] > >> + ------------ [vfio_ap-passthrough] > >> + --------------- [devices] > >> + ------------------ [$uuid1] > >> + --------------------- activate > >> + --------------------- remove > >> + > >> + > >> + echo 0 > activate > >> + echo 1 > remove > >> + > >> + This will release all the AP queues for the mediated device and > >> + remove all of the mdev matrix device's sysfs structures. To > >> + recreate and reconfigure the mdev matrix device, all of the steps starting > >> + with step 4 will have to be performed again. > >> + > >> + It is not necessary to remove an mdev matrix device, but one may want to > >> + remove it if no guest will use it during the lifetime of the linux host. If > >> + the mdev matrix device is removed, one may want to unbind the AP queues the > >> + guest was using from the vfio_ap device driver and bind them back to the > >> + default driver. Alternatively, the AP queues can be configured for another > >> + mdev matrix (i.e., guest). In either case, one must take care to change the > >> + secure key configured for the domain to which the queue is connected. > > This seems prime for data leakage, extremely sensitive data at that. > > Who's responsibility is it to prevent this? Shouldn't closing the > > device reset the device, which should perhaps wipe any key > > configuration? > > The device is zeroized (reset) when it is removed. The zeroize > instruction - to > the best of my knowledge - does nothing with the secure key for a given > queue. > There is also no instruction that I know of to clear keys, so it would > appear > that this is left to the system administrator. :-o That seems like a pretty serious fault. Given that this interface is specifically for crypto devices where such sensitive information seems common, doesn't it seem like the kernel, ie. the mdev vendor driver, should provide guarantees that should your user instance crash or be killed that keys are not available to the system admin? > >> + > >> + > >> +Limitations > >> +=========== > >> +An admin needs to be careful when writing to sysfs files > >> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver > >> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap > >> +driver and assigned to a mediated device, admin should not re-assign the > >> +AP queues back to the default driver, because of technical limitations. > >> +The kernel does not prevent you from re-assigning from AP queues reserved > >> +for the vfio_ap driver back to default driver. Future enhancements in > >> +the ap_bus and vfio_ap are likely to introduce complete support for such > >> +dynamic reconfiguration. But until then be extremely careful. > > Why don't we have these enhancements now? Should the whole thing be > > marked experimental without them? This sounds similar to a vfio-pci > > case where a group with multiple devices could have device which is > > unused by the user unbound from vfio-pci and re-bound to a native host > > driver. We BUG_ON when this occurs to protect the data. > > We have customers that have been waiting a long time and are very anxious > to have guest access to AP devices. In the interest of reducing time to > market, we have decided to simplify the initial implementation as much > as possible. I wouldn't necessarily call this experimental, but view it > more as the minimal viable product. > > As far as BUG_ON, I will discuss this with the crypto team at our > meeting on Thursday and get back to you. A minimum viable product still needs to protect the host and does not excuse us from later breaking the user exposed ABI. > > Probably also need to evaluate updates to > > Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is > > supposed to interact with an 'activate' attribute. Thanks, > > As it stands, libvirt will not create or modify the attributes of an mdev. > If a guest is started under libvirt and its mdev has not been activated, > it will be activated when the mdev fd is opened. Of course, if another > activated mdev is assigned an APQN assigned to the mdev of the guest > being started, the guest will be terminated. Exactly why I disapprove of the activated interface, but it's being proposed and recommended for use as a standard part of the lifecycle management for these devices. Thanks, Alex
On 08/01/2018 02:08 PM, Alex Williamson wrote: > On Wed, 1 Aug 2018 13:12:05 +0200 > Halil Pasic <pasic@linux.ibm.com> wrote: > >> On 07/27/2018 07:44 PM, Alex Williamson wrote: >>> On Thu, 26 Jul 2018 21:54:29 +0200 >>> Christian Borntraeger <borntraeger@de.ibm.com> wrote: >>> ... >>>> +The process for reserving an AP queue for use by a KVM guest is: >>>> + >>>> +* The vfio-ap driver during its initialization will perform the following: >>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >>>> + * Create the 'matrix' device in the 'vfio_ap' root >>>> + * Register the matrix device with the device core >>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >>>> + newer) and to provide the vfio_ap driver's probe and remove callback >>>> + interfaces. The reason why older devices are not supported is because there >>>> + are no systems available on which to test. >>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >>>> + mask to /sys/bus/ap/aqmask. >>>> + >>>> + For example to reserve all the AP Queues on the system for vfio_ap driver: >>>> + >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask >>> And this is a reasonable user syntax? ;) >>> >> We intend to provide tooling that makes this more human friendly. BTW >> echo 0 > sys/bus/ap/apmask would also work -- and looks less scary. > If tooling is required, perhaps the interface is wrong by design. > Maybe the interface should accept a list of bits, for example: > > echo 255,0 > /sys/bus/ap/aqmask > > vs > > echo 0x8000000000000000000000000000000000000000000000000000000000000001 > /sys/bus/ap/aqmask > >>> ... >>>> + * mdev_attr_groups >>>> + This attribute group identifies the user-defined sysfs attributes of the >>>> + mediated device. When a device is registered with the VFIO mediated device >>>> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' >>>> + structure will be created in the mediated matrix device's directory. The >>>> + sysfs attributes for a mediated matrix device are: >>>> + * assign_adapter: >>>> + A write-only file for assigning an AP adapter to the mediated matrix >>>> + device. To assign an adapter, the APID of the adapter is written to the >>>> + file. >>>> + * assign_domain: >>>> + A write-only file for assigning an AP usage domain to the mediated matrix >>>> + device. To assign a domain, the APQI of the AP queue corresponding to a >>>> + usage domain is written to the file. >>>> + * matrix: >>>> + A read-only file for displaying the APQNs derived from the adapters and >>>> + domains assigned to the mediated matrix device. >>>> + * assign_control_domain: >>>> + A write-only file for assigning an AP control domain to the mediated >>>> + matrix device. To assign a control domain, the ID of a domain to be >>>> + controlled is written to the file. For the initial implementation, the set >>>> + of control domains will always include the set of usage domains, so it is >>>> + only necessary to assign control domains that are not also assigned as >>>> + usage domains. >>> >>> How will the user know when this changes? What's the transition plan >>> to maintain compatibility when it does change? >>> >> I don't think this is supposed to change under the users feet. This document >> probably started out as a design document. While I'm personally in favor of >> managing the two separate, I think we need to commit to either-or and >> get rid of 'For the initial implementation,'. > +1 > >>> ... >>>> +4. The administrator now needs to configure the matrixes for mediated >>>> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). >>>> + >>>> + This is how the matrix is configured for Guest1: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 6 > assign_adapter >>>> + echo 4 > assign_domain >>>> + echo 0xab > assign_domain >>>> + >>>> + For this implementation, all usage domains - i.e., domains assigned >>>> + via the assign_domain attribute file - will also be configured in the ADM >>>> + field of the KVM guest's CRYCB, so there is no need to assign control >>>> + domains here unless you want to assign control domains that are not >>>> + assigned as usage domains. >>>> + >>>> + If a mistake is made configuring an adapter, domain or control domain, >>>> + you can use the unassign_xxx files to unassign the adapter, domain or >>>> + control domain. >>> Would it be more consistent with other sysfs entries to call these >>> "bind" and "unbind" rather than "assign" and "unassign"? >>> >> I'm not sure. Let me recap what is this about. First we partition our AP >> resources (queues) into 'to be used by the host' and 'to be used by the >> guests' using the /sys/bus/ap/a[pq]mask. Then we sub-partition 'to be >> used by the guests' among the guests. Each vfio ap_matrix represents a >> partition that corresponds to a single VM (guest). Such a partition can >> be seen as a device that grants access (usage or administration) to a set >> of queues. We can grant access to queues even if these are not currently >> available in the system. E.g. we can partition an AP card that is >> currently not online or even plugged. When The resources become available >> (e.g. new card plugged, or just 'configured'), the resources (should) >> become available at all levels if the authorization is right. I know >> 'bind' binding a device (that currently exists within the system) to a >> driver. The vfio_ap stuff seems different enough to seek a different name >> for it. > As discussed elsewhere, I'm not in favor of allowing mdev devices to > exist which rely on resources that are not present or in-use > elsewhere. If we remove that concept, then we are actually binding and > unbinding resources to the mdev device. > >>>> + >>>> + To display the matrix configuration for Guest1: >>>> + >>>> + cat matrix >>>> + >>>> + This is how the matrix is configured for Guest2: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 0x47 > assign_domain >>>> + echo 0xff > assign_domain >>>> + >>>> +5. The adminstrator now needs to activate the mediated devices. >>>> + echo 1 > activate >>> Or optionally not. As in reply to cover letter, I don't think this >>> interface is sufficiently justified, or necessarily desirable. >>> >> You are right. Activation is not necessary because the open will try to >> auto-activate it. But honestly, I would recommend activating it in >> advance, so the admin can be sure the resources are claimed. > This can also be solved by claiming the resource at the point at which > it is bound/assigned/attached to the mdev device (ie. at any point > where we introduce new intersections in the matrix). > >> In our current design we choose to give the administrator the freedom, to >> decide on his policy how does he want to manage the vfio_ap mdevs >> lifecycle. And to decide what guarantees does he want to have at a given >> point in time. With 'activate' both 'dynamically and on-demand of >> starting a VM' and 'create known vfio_ap mdevs at system bring up' are >> viable. If there were a consensus on how the life-cycle of the mdev >> devices is to be managed, it would be easier to not give this freedom to >> the admin. > Please, this is not a freedom of the admin issue, this simply defines > the point in time at which bound/assigned/attached resources are > committed to the mdev device. In the model proposed here, the activate > attribute allows resources to be over-committed, placing the burden on > the management tools to implicitly understand which mdev devices can be > used simultaneously. With the existing model, resources are committed > on creation. I'll agree that such a model doesn't necessarily apply to > these new devices, but I'd argue the next logical step is that > resources are committed as they're attached such that the mdev device > is always in a usable state. AFAICT, the same "freedom" is available > to the admin by simply scripting mdev creation and assembly without a > change to the mdev interface. > >>>> +6. Start Guest1: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... >>>> + >>>> +7. Start Guest2: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... >>>> + >>>> +When the guest is shut down, the mediated matrix device may be removed. >>>> + >>>> +Using our example again, to remove the mediated matrix device $uuid1: >>>> + >>>> + /sys/devices/virtual/misc >>>> + --- [vfio_ap] >>>> + --------- [mdev_supported_types] >>>> + ------------ [vfio_ap-passthrough] >>>> + --------------- [devices] >>>> + ------------------ [$uuid1] >>>> + --------------------- activate >>>> + --------------------- remove >>>> + >>>> + >>>> + echo 0 > activate >>>> + echo 1 > remove >>>> + >>>> + This will release all the AP queues for the mediated device and >>>> + remove all of the mdev matrix device's sysfs structures. To >>>> + recreate and reconfigure the mdev matrix device, all of the steps starting >>>> + with step 4 will have to be performed again. >>>> + >>>> + It is not necessary to remove an mdev matrix device, but one may want to >>>> + remove it if no guest will use it during the lifetime of the linux host. If >>>> + the mdev matrix device is removed, one may want to unbind the AP queues the >>>> + guest was using from the vfio_ap device driver and bind them back to the >>>> + default driver. Alternatively, the AP queues can be configured for another >>>> + mdev matrix (i.e., guest). In either case, one must take care to change the >>>> + secure key configured for the domain to which the queue is connected. >>> This seems prime for data leakage, extremely sensitive data at that. >>> Who's responsibility is it to prevent this? Shouldn't closing the >>> device reset the device, which should perhaps wipe any key >>> configuration? >>> >> The documentation seems to be a bit outdated here. We do reset the device >> in the release callback and on VFIO_DEVICE_RESET ioctl. There might be some >> stuff that is not supposed to be reset by us, but that is fine (master keys >> managed via special infrastructure called the TKE). I'm not sure what is the >> meaning of last sentence of the paragraph in question ('change the secure >> key configured ...') though. Anyway we will have to update this. >> >>>> + >>>> + >>>> +Limitations >>>> +=========== >>>> +An admin needs to be careful when writing to sysfs files >>>> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver >>>> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap >>>> +driver and assigned to a mediated device, admin should not re-assign the >>>> +AP queues back to the default driver, because of technical limitations. >>>> +The kernel does not prevent you from re-assigning from AP queues reserved >>>> +for the vfio_ap driver back to default driver. Future enhancements in >>>> +the ap_bus and vfio_ap are likely to introduce complete support for such >>>> +dynamic reconfiguration. But until then be extremely careful. >>> Why don't we have these enhancements now? Should the whole thing be >>> marked experimental without them? This sounds similar to a vfio-pci >>> case where a group with multiple devices could have device which is >>> unused by the user unbound from vfio-pci and re-bound to a native host >>> driver. We BUG_ON when this occurs to protect the data. >> I think, marking vfio_ap as experimental is probably a good idea. I think >> we can do these enhancements later. Of course, having everything right away >> is the best, but the problem with that is that it takes (more) time. >> >>> Probably also need to evaluate updates to >>> Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is >>> supposed to interact with an 'activate' attribute. Thanks, >>> >> I'm not sure documenting the 'activate' in Documentation/ABI/testing/sysfs-bus-vfio-mdev >> is necessary. I see it more like the other type specific attributes. For >> instance drivers/gpu/drm/i915/gvt/kvmgt.c seems to have vgpu_id and >> https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-red-hat-el-kvm >> suggests that Nvidia vgpus have a vgpu_params (read-write) attribute to >> 'control the behavior of the vGPU, such as the frame rate limiter...'. > I disagree, the difference is that the existing vendor attributes can > modify the behavior of the mdev device, but are not part of the > standard mdev sysfs ABI. You recommend above that 'activate' should be > used prior to open(2) in order to commit mdev resources. That's a > fundamental change in the management of the device, which is only > obscured by the fact that open(2) does an implicit activate. Let's > please just drop this whole notion of 'activate'. You'll be happy to know the team has decided to drop the activate. > >> AFAIK the activate and the assign/unassign attributes are because we can >> not do much with homogeneous vfio_ap mdevs. The only safe default for >> a vfio_ap mdev is all zero masks, that can't use any resources, that >> is essentially useless. And this is how we construct the vfio_ap mdev >> devices. In that sense we could as well construct them 'activated', >> as empty won't result in any kind of conflicts (and allow the admin >> to de-activate these so we are still flexible about the lifecycle >> management). > I agree that a zero mask is the only safe default for creation and that > vendor specific attributes are necessary to build the configuration up > to something that provides some usable value (unless we decide to > expand the create interface). As above though, I disagree about what > 'activate' does or does not provide. I don't see that it adds and > flexibility in lifecycle management, at least not in a way that's > compatible with existing mdev devices or in a way to facilitates upper > level software management. It seems to only allow the admin to avoid > dynamically creating and attaching resources to mdevs, at the cost of > manage layers needing to account for over-commitment. Thanks, > > Alex >
On 08/02/2018 11:25 AM, Alex Williamson wrote: > On Wed, 1 Aug 2018 19:05:35 -0400 > Tony Krowiak <akrowiak@linux.ibm.com> wrote: > >> On 07/27/2018 01:44 PM, Alex Williamson wrote: >>> On Thu, 26 Jul 2018 21:54:29 +0200 >>> Christian Borntraeger <borntraeger@de.ibm.com> wrote: >>> ... >>>> +The process for reserving an AP queue for use by a KVM guest is: >>>> + >>>> +* The vfio-ap driver during its initialization will perform the following: >>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >>>> + * Create the 'matrix' device in the 'vfio_ap' root >>>> + * Register the matrix device with the device core >>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >>>> + newer) and to provide the vfio_ap driver's probe and remove callback >>>> + interfaces. The reason why older devices are not supported is because there >>>> + are no systems available on which to test. >>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >>>> + mask to /sys/bus/ap/aqmask. >>>> + >>>> + For example to reserve all the AP Queues on the system for vfio_ap driver: >>>> + >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask >>> And this is a reasonable user syntax? ;) >> I am not a fan of writing a 256-bit bitmask as such. It is entirely too >> difficult to >> figure out which character to specify to indicate a bit number to be set >> for anything >> past the first few characters. Of course, tooling could be used to >> accomplish this >> task, but the sysfs interfaces should probably be user-friendly. A >> comma-separated >> list could be used; for example to set reserve adapters 21, 99 and 249: >> >> echo 21,99,249 > /sys/bus/ap/apmask >> >> The problem with this is how do we indicate that all bits are to be >> cleared? Do you >> have any suggestions for a reasonable user syntax? > Perhaps an empty write, just a carriage return, "null". I'd shy away > from magic digits, like 256 to clear. The crypto team has decided to eliminate these sysfs attributes for the AP bus and rely on bind/unbind. > >>> ... >>>> + * mdev_attr_groups >>>> + This attribute group identifies the user-defined sysfs attributes of the >>>> + mediated device. When a device is registered with the VFIO mediated device >>>> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' >>>> + structure will be created in the mediated matrix device's directory. The >>>> + sysfs attributes for a mediated matrix device are: >>>> + * assign_adapter: >>>> + A write-only file for assigning an AP adapter to the mediated matrix >>>> + device. To assign an adapter, the APID of the adapter is written to the >>>> + file. >>>> + * assign_domain: >>>> + A write-only file for assigning an AP usage domain to the mediated matrix >>>> + device. To assign a domain, the APQI of the AP queue corresponding to a >>>> + usage domain is written to the file. >>>> + * matrix: >>>> + A read-only file for displaying the APQNs derived from the adapters and >>>> + domains assigned to the mediated matrix device. >>>> + * assign_control_domain: >>>> + A write-only file for assigning an AP control domain to the mediated >>>> + matrix device. To assign a control domain, the ID of a domain to be >>>> + controlled is written to the file. For the initial implementation, the set >>>> + of control domains will always include the set of usage domains, so it is >>>> + only necessary to assign control domains that are not also assigned as >>>> + usage domains. >>> How will the user know when this changes? What's the transition plan >>> to maintain compatibility when it does change? >> I'm sorry, I don't understand the question. When you say user, about >> whom are >> you talking; the person doing the assignments? What changes are you talking >> about? > The user would be the person interaction with the sysfs attribute > assign_control_domain where it states "[f]or the initial implementation, > the set of control domains will always include the set of usage > domains, so it is only necessary to assign control domains that are not > also assigned as usage domains." I infer from the usage of "initial > implementation" that this behavior may not always be the case and > therefore if it's not final what is the transition plan so that a user > interacting with this attribute can know the current behavior. Ah, I get your point now. I see how this verbiage would lead you to that conclusion and it will be removed. This behavior will not change. > >>> ... >>>> +4. The administrator now needs to configure the matrixes for mediated >>>> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). >>>> + >>>> + This is how the matrix is configured for Guest1: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 6 > assign_adapter >>>> + echo 4 > assign_domain >>>> + echo 0xab > assign_domain >>>> + >>>> + For this implementation, all usage domains - i.e., domains assigned >>>> + via the assign_domain attribute file - will also be configured in the ADM >>>> + field of the KVM guest's CRYCB, so there is no need to assign control >>>> + domains here unless you want to assign control domains that are not >>>> + assigned as usage domains. >>>> + >>>> + If a mistake is made configuring an adapter, domain or control domain, >>>> + you can use the unassign_xxx files to unassign the adapter, domain or >>>> + control domain. >>> Would it be more consistent with other sysfs entries to call these >>> "bind" and "unbind" rather than "assign" and "unassign >> Aren't the bind and unbind sysfs interfaces typically associated with >> binding devices to and unbinding devices from a device driver? > Yes > >> In this >> case we are talking about configuring a mediated device's AP matrix - i.e., >> setting bits identifying the adapters, domains and control domains defined >> for the mediated device (i.e., a guest). Maybe there are better words we >> could use than assign/unassign, but I find the use of bind/unbind to be >> confusing. I'm open to suggestions. > If we remove the 'activate' interface and dedicate resources to the > mdev device at the point where we create an intersection in the matrix, > wouldn't that resource be "bound" to the mdev at that time? This is > not a strong issue for me, I just thought "assign" seemed inconsistent > as the operation felt more like a traditional "bind". We are going to remove the activate for the next series, but stick with the assign terminology because we're not really binding devices, but constructing a matrix that identifies the AP queues to which the guest using the mdev will have access. > >>>> + >>>> + To display the matrix configuration for Guest1: >>>> + >>>> + cat matrix >>>> + >>>> + This is how the matrix is configured for Guest2: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 0x47 > assign_domain >>>> + echo 0xff > assign_domain >>>> + >>>> +5. The adminstrator now needs to activate the mediated devices. >>>> + echo 1 > activate >>> Or optionally not. As in reply to cover letter, I don't think this >>> interface is sufficiently justified, or necessarily desirable. >> To clarify your objection, let me state what I think you are saying. >> >> 1. You do not think there is a good reason to assign duplicate APQNs >> - i.e., the cross product of all adapter IDs and domain IDs assigned >> to the mdev. > I think the cross product of adapters and domains is a basic part of > the design here, what I object to is that mdevs can exist with > overlapping cross products and which one can be activated is dependent > on activation ordering. For example, if I have the following mdevs: > > {adapters}{domains} > A: {0,1}{0,1} > B: {0}{0} > C: {0}{1} > D: {1}{0} > E: {1}{1} > > Who is supposed to understand and debug that A cannot be activated > while any of B/C/D/E are activated? What does the debugging process > look like? If the mdev is not explicitly activated, only opened, how > do we determine the fault is from a resource conflict on the mdev > definition? On the other hand if the cross product is committed at the > time it's specified, then a scenario where E already exists and we're > trying to create A might look like: > > 1. Specify adapter mask of {0,1} > 2. Specify domain mask of {0,1} <-- write(2) fails with -EBUSY > > Here we know that one or more of the domains we've specified are not > available on the set of adapters we've configured. This is moot for the next series because the activate will be removed and consistency checking will be done as adapters/domains are assigned. > >> 2. All mdevs should be validly configured for use regardless of when >> guests using them are started. In other words, libvirt should be >> able to depend on the fact that an mdev is ready for use at all >> times. > Yes, an mdev should represent committed resources. We support dynamic > creation and removal if the resources need to be freed for the host or > another mdev configuration. See my previous response. > >> I will discuss this with the team when we meet on Thursday and get >> back to you. If I am mistaken in my analysis of your concern, please >> clarify. >> >> I would like to mention that when a guest using an mdev starts, >> the mdev will be activated if not already activated. The guest >> will not start if another activated mdev is using an APQN assigned >> to the mdev of the guest being started. > Understood. > >>>> +6. Start Guest1: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... >>>> + >>>> +7. Start Guest2: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... >>>> + >>>> +When the guest is shut down, the mediated matrix device may be removed. >>>> + >>>> +Using our example again, to remove the mediated matrix device $uuid1: >>>> + >>>> + /sys/devices/virtual/misc >>>> + --- [vfio_ap] >>>> + --------- [mdev_supported_types] >>>> + ------------ [vfio_ap-passthrough] >>>> + --------------- [devices] >>>> + ------------------ [$uuid1] >>>> + --------------------- activate >>>> + --------------------- remove >>>> + >>>> + >>>> + echo 0 > activate >>>> + echo 1 > remove >>>> + >>>> + This will release all the AP queues for the mediated device and >>>> + remove all of the mdev matrix device's sysfs structures. To >>>> + recreate and reconfigure the mdev matrix device, all of the steps starting >>>> + with step 4 will have to be performed again. >>>> + >>>> + It is not necessary to remove an mdev matrix device, but one may want to >>>> + remove it if no guest will use it during the lifetime of the linux host. If >>>> + the mdev matrix device is removed, one may want to unbind the AP queues the >>>> + guest was using from the vfio_ap device driver and bind them back to the >>>> + default driver. Alternatively, the AP queues can be configured for another >>>> + mdev matrix (i.e., guest). In either case, one must take care to change the >>>> + secure key configured for the domain to which the queue is connected. >>> This seems prime for data leakage, extremely sensitive data at that. >>> Who's responsibility is it to prevent this? Shouldn't closing the >>> device reset the device, which should perhaps wipe any key >>> configuration? >> The device is zeroized (reset) when it is removed. The zeroize >> instruction - to >> the best of my knowledge - does nothing with the secure key for a given >> queue. >> There is also no instruction that I know of to clear keys, so it would >> appear >> that this is left to the system administrator. > :-o That seems like a pretty serious fault. Given that this interface > is specifically for crypto devices where such sensitive information > seems common, doesn't it seem like the kernel, ie. the mdev vendor > driver, should provide guarantees that should your user instance crash > or be killed that keys are not available to the system admin? I may have misspoke with regard to my understanding of the AP reset. I've been told that the keys stored with the domain will be reset. I will verify this and get back to you. > >>>> + >>>> + >>>> +Limitations >>>> +=========== >>>> +An admin needs to be careful when writing to sysfs files >>>> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver >>>> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap >>>> +driver and assigned to a mediated device, admin should not re-assign the >>>> +AP queues back to the default driver, because of technical limitations. >>>> +The kernel does not prevent you from re-assigning from AP queues reserved >>>> +for the vfio_ap driver back to default driver. Future enhancements in >>>> +the ap_bus and vfio_ap are likely to introduce complete support for such >>>> +dynamic reconfiguration. But until then be extremely careful. >>> Why don't we have these enhancements now? Should the whole thing be >>> marked experimental without them? This sounds similar to a vfio-pci >>> case where a group with multiple devices could have device which is >>> unused by the user unbound from vfio-pci and re-bound to a native host >>> driver. We BUG_ON when this occurs to protect the data. >> We have customers that have been waiting a long time and are very anxious >> to have guest access to AP devices. In the interest of reducing time to >> market, we have decided to simplify the initial implementation as much >> as possible. I wouldn't necessarily call this experimental, but view it >> more as the minimal viable product. >> >> As far as BUG_ON, I will discuss this with the crypto team at our >> meeting on Thursday and get back to you. > A minimum viable product still needs to protect the host and does not > excuse us from later breaking the user exposed ABI. I will have a more thorough look into it. > >>> Probably also need to evaluate updates to >>> Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is >>> supposed to interact with an 'activate' attribute. Thanks, >> As it stands, libvirt will not create or modify the attributes of an mdev. >> If a guest is started under libvirt and its mdev has not been activated, >> it will be activated when the mdev fd is opened. Of course, if another >> activated mdev is assigned an APQN assigned to the mdev of the guest >> being started, the guest will be terminated. > Exactly why I disapprove of the activated interface, but it's being > proposed and recommended for use as a standard part of the lifecycle > management for these devices. Thanks, As I stated elsewhere, we are removing the activate interface. All consistency checking will be done as adapters/domains are assigned to the mdev. > > Alex >
diff --git a/Documentation/s390/vfio-ap.txt b/Documentation/s390/vfio-ap.txt new file mode 100644 index 000000000000..e7cda892817e --- /dev/null +++ b/Documentation/s390/vfio-ap.txt @@ -0,0 +1,591 @@ +Introduction: +============ +The Adjunct Processor (AP) facility is an IBM Z cryptographic facility comprised +of three AP instructions and from 1 up to 256 PCIe cryptographic adapter cards. +The AP devices provide cryptographic functions to all CPUs assigned to a +linux system running in an IBM Z system LPAR. + +The AP adapter cards are exposed via the AP bus. The motivation for vfio-ap +is to make AP cards available to KVM guests using the VFIO mediated device +framework. This implementation relies considerably on the s390 virtualization +facilities which do most of the hard work of providing direct access to AP +devices. + +AP Architectural Overview: +========================= +To facilitate the comprehension of the design, let's start with some +definitions: + +* AP adapter + + An AP adapter is an IBM Z adapter card that can perform cryptographic + functions. There can be from 0 to 256 adapters assigned to an LPAR. Adapters + assigned to the LPAR in which a linux host is running will be available to + the linux host. Each adapter is identified by a number from 0 to 255. When + installed, an AP adapter is accessed by AP instructions executed by any CPU. + + The AP adapter cards are assigned to a given LPAR via the system's Activation + Profile which can be edited via the HMC. When the system is IPL'd, the AP bus + module is loaded and detects the AP adapter cards assigned to the LPAR. The AP + bus creates a sysfs device for each adapter as they are detected. For example, + if AP adapters 4 and 10 (0x0a) are assigned to the LPAR, the AP bus will + create the following sysfs entries: + + /sys/devices/ap/card04 + /sys/devices/ap/card0a + + Symbolic links to these devices will also be created in the AP bus devices + sub-directory: + + /sys/bus/ap/devices/[card04] + /sys/bus/ap/devices/[card04] + +* AP domain + + An adapter is partitioned into domains. Each domain can be thought of as + a set of hardware registers for processing AP instructions. An adapter can + hold up to 256 domains. Each domain is identified by a number from 0 to 255. + Domains can be further classified into two types: + + * Usage domains are domains that can be accessed directly to process AP + commands. + + * Control domains are domains that are accessed indirectly by AP + commands sent to a usage domain to control or change the domain, for + example; to set a secure private key for the domain. + + The AP usage and control domains are assigned to a given LPAR via the system's + Activation Profile which can be edited via the HMC. When the system is IPL'd, + the AP bus module is loaded and detects the AP usage and control domains + assigned to the LPAR. The domain number of each usage domain will be coupled + with the adapter number of each AP adapter assigned to the LPAR to identify + the AP queues (see AP Queue section below). The domain number of each control + domain will be represented in a bitmask and stored in a sysfs file + /sys/bus/ap/ap_control_domain_mask created by the bus. The bits in the mask, + from most to least significant bit, correspond to domains 0-255. + + A domain may be assigned to a system as both a usage and control domain, or + as a control domain only. Consequently, all domains assigned as both a usage + and control domain can both process AP commands as well as be changed by an AP + command sent to any usage domain assigned to the same system. Domains assigned + only as control domains can not process AP commands but can be changed by AP + commands sent to any usage domain assigned to the system. + +* AP Queue + + An AP queue is the means by which an AP command-request message is sent to a + usage domain inside a specific adapter. An AP queue is identified by a tuple + comprised of an AP adapter ID (APID) and an AP queue index (APQI). The + APQI corresponds to a given usage domain number within the adapter. This tuple + forms an AP Queue Number (APQN) uniquely identifying an AP queue. AP + instructions include a field containing the APQN to identify the AP queue to + which the AP command-request message is to be sent for processing. + + The AP bus will create a sysfs device for each APQN that can be derived from + the intersection of the AP adapter and usage domain numbers detected when the + AP bus module is loaded. For example, if adapters 4 and 10 (0x0a) and usage + domains 6 and 71 (0x47) are assigned to the LPAR, the AP bus will create the + following sysfs entries: + + /sys/devices/ap/card04/04.0006 + /sys/devices/ap/card04/04.0047 + /sys/devices/ap/card0a/0a.0006 + /sys/devices/ap/card0a/0a.0047 + + The following symbolic links to these devices will be created in the AP bus + devices subdirectory: + + /sys/bus/ap/devices/[04.0006] + /sys/bus/ap/devices/[04.0047] + /sys/bus/ap/devices/[0a.0006] + /sys/bus/ap/devices/[0a.0047] + +* AP Instructions: + + There are three AP instructions: + + * NQAP: to enqueue an AP command-request message to a queue + * DQAP: to dequeue an AP command-reply message from a queue + * PQAP: to administer the queues + +AP and SIE: +========== +Let's now see how AP instructions are interpreted by the hardware. + +A satellite control block called the Crypto Control Block is attached to our +main hardware virtualization control block. The CRYCB contains three fields to +identify the adapters, usage domains and control domains assigned to the KVM +guest: + +* The AP Mask (APM) field is a bit mask that identifies the AP adapters assigned + to the KVM guest. Each bit in the mask, from most significant to least + significant bit, corresponds to an APID from 0-255. If a bit is set, the + corresponding adapter is valid for use by the KVM guest. + +* The AP Queue Mask (AQM) field is a bit mask identifying the AP usage domains + assigned to the KVM guest. Each bit in the mask, from most significant to + least significant bit, corresponds to an AP queue index (APQI) from 0-255. If + a bit is set, the corresponding queue is valid for use by the KVM guest. + +* The AP Domain Mask field is a bit mask that identifies the AP control domains + assigned to the KVM guest. The ADM bit mask controls which domains can be + changed by an AP command-request message sent to a usage domain from the + guest. Each bit in the mask, from least significant to most significant bit, + corresponds to a domain from 0-255. If a bit is set, the corresponding domain + can be modified by an AP command-request message sent to a usage domain + configured for the KVM guest. + +If you recall from the description of an AP Queue, AP instructions include +an APQN to identify the AP adapter and AP queue to which an AP command-request +message is to be sent (NQAP and PQAP instructions), or from which a +command-reply message is to be received (DQAP instruction). The validity of an +APQN is defined by the matrix calculated from the APM and AQM; it is the +cross product of all assigned adapter numbers (APM) with all assigned queue +indexes (AQM). For example, if adapters 1 and 2 and usage domains 5 and 6 are +assigned to a guest, the APQNs (1,5), (1,6), (2,5) and (2,6) will be valid for +the guest. + +The APQNs can provide secure key functionality - i.e., a private key is stored +on the adapter card for each of its domains - so each APQN must be assigned to +at most one guest or the linux host. + + Example 1: Valid configuration: + ------------------------------ + Guest1: adapters 1,2 domains 5,6 + Guest2: adapter 1,2 domain 7 + + This is valid because both guests have a unique set of APQNs: Guest1 has + APQNs (1,5), (1,6), (2,5) and (2,6); Guest2 has APQNs (1,7) and (2,7). + + Example 2: Invalid configuration: + --------------------------------is assigned by writing the adapter's number into the + Guest1: adapters 1,2 domains 5,6 + Guest2: adapter 1 domains 6,7 + + This is an invalid configuration because both guests have access to + APQN (1,6). + +The Design: +=========== +The design introduces three new objects: + +1. AP matrix device +2. VFIO AP device driver (vfio_ap.ko) +3. AP mediated matrix passthrough device + +The VFIO AP device driver +------------------------- +The VFIO AP (vfio_ap) device driver serves the following purposes: + +1. Provides the interfaces to reserve APQNs for exclusive use of KVM guests. + +2. Sets up the VFIO mediated device interfaces to manage the mediated matrix + device and create the sysfs interfaces for assigning adapters, usage domains, + and control domains comprising the matrix for a KVM guest. + +3. Configure the APM, AQM and ADM in the CRYCB referenced by a KVM guest's + SIE state description to grant the guest access to AP devices + +4. Initialize the CPU model feature indicating that a KVM guest may use + AP facilities installed on the linux host. + +5. Enable interpretive execution mode for the KVM guest. + +Reserve APQNs for exclusive use of KVM guests +--------------------------------------------- +The following block diagram illustrates the mechanism by which APQNs are +reserved: + + +------------------+ + remove | | unbind + +------------------->+ cex4queue driver +<-----------+ + | | | | + | +------------------+ | + | | + | | + | | ++--------+---------+ register +------------------+ +-----+------+ +| +<---------+ | bind | | +| ap_bus | | vfio_ap driver +<-----+ admin | +| +--------->+ | | | ++------------------+ probe +---+--------+-----+ +------------+ + | | + create | | store APQN + | | + v v + +---+--------+-----+ + | | + | matrix device | + | | + +------------------+ + +The process for reserving an AP queue for use by a KVM guest is: + +* The vfio-ap driver during its initialization will perform the following: + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap + * Create the 'matrix' device in the 'vfio_ap' root + * Register the matrix device with the device core +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and + newer) and to provide the vfio_ap driver's probe and remove callback + interfaces. The reason why older devices are not supported is because there + are no systems available on which to test. +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain + mask to /sys/bus/ap/aqmask. + + For example to reserve all the AP Queues on the system for vfio_ap driver: + + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask + + This would unbind all the AP Queues from the default zcrypt (cex4queue) driver and + make the AP Queues available for vfio_ap driver. + + Another option would be to specify the apmask and aqmask via kernel command line. + eg: + ap.apmask=0x0 ap.aqmask=0x0 + + This would reserve all the AP Queues to be by vfio_ap driver. + +Set up the VFIO mediated device interfaces +------------------------------------------ +The VFIO AP device driver utilizes the common interface of the VFIO mediated +device core driver to: +* Register an AP mediated bus driver to add a mediated matrix device to and + remove it from a VFIO group. +* Create and destroy a mediated matrix device +* Add a mediated matrix device to and remove it from the AP mediated bus driver +* Add a mediated matrix device to and remove it from an IOMMU group + +The following high-level block diagram shows the main components and interfaces +of the VFIO AP mediated matrix device driver: + + +-------------+ + | | + | +---------+ | mdev_register_driver() +--------------+ + | | Mdev | +<-----------------------+ | + | | bus | | | vfio_mdev.ko | + | | driver | +----------------------->+ |<-> VFIO user + | +---------+ | probe()/remove() +--------------+ APIs + | | + | MDEV CORE | + | MODULE | + | mdev.ko | + | +---------+ | mdev_register_device() +--------------+ + | |Physical | +<-----------------------+ | + | | device | | | vfio_ap.ko |<-> matrix + | |interface| +----------------------->+ | device + | +---------+ | callback +--------------+ + +-------------+ + +During initialization of the vfio_ap module, the matrix device is registered +with an 'mdev_parent_ops' structure that provides the sysfs attribute +structures, mdev functions and callback interfaces for managing the mediated +matrix device. + +* sysfs attribute structures: + * supported_type_groups + The VFIO mediated device framework supports creation of user-defined + mediated device types. These mediated device types are specified + via the 'supported_type_groups' structure when a device is registered + with the mediated device framework. The registration process creates the + sysfs structures for each mediated device type specified in the + 'mdev_supported_types' sub-directory of the device being registered. Along + with the device type, the sysfs attributes of the mediated device type are + provided. + + The VFIO AP device driver will register one mediated device type for + passthrough devices: + /sys/devices/vfio_ap/mdev_supported_types/vfio_ap-passthrough + Only the three read-only attributes required by the VFIO mdev framework will + be provided: + /sys/devices/vfio_ap/mdev_supported_types + ... name + ... device_api + ... available_instances + Where: + * name: specifies the name of the mediated device type + * device_api: the mediated device type's API + * available_instances: the number of mediated matrix passthrough devices + that can be created + * mdev_attr_groups + This attribute group identifies the user-defined sysfs attributes of the + mediated device. When a device is registered with the VFIO mediated device + framework, the sysfs attributes files identified in the 'mdev_attr_groups' + structure will be created in the mediated matrix device's directory. The + sysfs attributes for a mediated matrix device are: + * assign_adapter: + A write-only file for assigning an AP adapter to the mediated matrix + device. To assign an adapter, the APID of the adapter is written to the + file. + * assign_domain: + A write-only file for assigning an AP usage domain to the mediated matrix + device. To assign a domain, the APQI of the AP queue corresponding to a + usage domain is written to the file. + * matrix: + A read-only file for displaying the APQNs derived from the adapters and + domains assigned to the mediated matrix device. + * assign_control_domain: + A write-only file for assigning an AP control domain to the mediated + matrix device. To assign a control domain, the ID of a domain to be + controlled is written to the file. For the initial implementation, the set + of control domains will always include the set of usage domains, so it is + only necessary to assign control domains that are not also assigned as + usage domains. + * control_domains: + A read-only file for displaying the control domain numbers assigned to the + mediated matrix device. + +* functions: + * create: + allocates the ap_matrix_mdev structure used by the vfio_ap driver to: + * Keep track of the available instances + * Store the reference to the struct kvm for the KVM guest + * Provide the notifier callback that will get invoked to handle the + VFIO_GROUP_NOTIFY_SET_KVM event. When received, the vfio_ap driver will + store the reference in the mediated matrix device's ap_matrix_mdev + structure and enable the interpretive execution mode for the KVM guest. + * remove: + deallocates the mediated matrix device's ap_matrix_mdev structure. + * activate: + Claims AP Queue resources for a mediated device. + +* callback interfaces + * open: + The vfio_ap driver uses this callback to register a + VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the mdev matrix + device. The notifier is invoked when QEMU connects the VFIO iommu group + for the mdev matrix device to the MDEV bus. Access to the KVM structure used + to configure the KVM guest is provided via this callback. The KVM structure, + is used to configure the guest's access to the AP matrix defined via the + mediated matrix device's sysfs attribute files. + * release: + unregisters the VFIO_GROUP_NOTIFY_SET_KVM notifier callback function for the + mdev matrix device and deconfigures the guest's AP matrix. + +Configure the APM, AQM and ADM in the CRYCB: +------------------------------------------- +Configuring the AP matrix for a KVM guest will be performed when the +VFIO_GROUP_NOTIFY_SET_KVM notifier callback is invoked. The notifier +function is called when QEMU connects the VFIO iommu group for the mdev matrix +device to the MDEV bus. The CRYCB is configured by: +* Setting the bits in the APM corresponding to the APIDs assigned to the + mediated matrix device via its 'assign_adapter' interface. +* Setting the bits in the AQM corresponding to the APQIs assigned to the + mediated matrix device via its 'assign_domain' interface. +* Setting the bits in the ADM corresponding to the domain dIDs assigned to the + mediated matrix device via its 'assign_control_domains' interface. + +Initialize the CPU model feature for AP +--------------------------------------- +A new CPU model feature, KVM_S390_VM_CPU_FEAT_AP, is introduced to indicate that +AP instructions are available to the KVM guest. This feature will be enabled by +KVM only if the AP instructions are installed on the linux host. The feature +must be turned on for the guest in order to access AP devices from the guest. +For example, to turn the AP facilities on from the QEMU command line: + + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on + + Where xxx is the CPU model being used. + + If the CPU model feature is not enabled by the kernel, QEMU will fail and + report that the feature is not supported. For most Linux guests the cpu model + feature apft must also be turned on. + +Example: +======= +Let's now provide an example to illustrate how KVM guests may be given +access to AP facilities. For this example, we will show how to configure +two guests such that executing the lszcrypt command on the guests would +look like this: + +Guest1 +------ +CARD.DOMAIN TYPE MODE +------------------------------ +05 CEX5C CCA-Coproc +05.0004 CEX5C CCA-Coproc +05.00ab CEX5C CCA-Coproc +06 CEX5A Accelerator +06.0004 CEX5A Accelerator +06.00ab CEX5C CCA-Coproc + +Guest2 +------ +CARD.DOMAIN TYPE MODE +------------------------------ +05 CEX5A Accelerator +05.0047 CEX5A Accelerator +05.00ff CEX5A Accelerator + +These are the steps: + +1. Install the vfio_ap module on the linux host. The dependency chain for the + vfio_ap module is: + * vfio + * mdev + * vfio_mdev + * KVM + * vfio_ap + +2. Secure the AP queues to be used by the two guests so that the host can not + access them. Only type 10 adapters (i.e., CEX4 and later) are supported + due to the fact that no test systems with older card types are available + for testing. + + To secure the AP queues each, each AP Queue device must first be unbound from + the cex4queue device driver. The sysfs location of the driver is: + + /sys/bus/ap + --- [drivers] + ------ [cex4queue] + --------- [05.0004] + --------- [05.0047] + --------- [05.00ab] + --------- [05.00ff] + --------- [06.0004] + --------- [06.00ab] + --------- unbind + + To unbind the AP queues from cex4queue driver, we need to update the + apmask and aqmask files. + + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask + + This would unbind all the AP queues and makes them available for vfio_ap + driver. + + Take note that the AP queues bound to the vfio_ap driver will be available + for guest usage until they are unbound from the driver, the vfio_ap module + is unloaded, or the host system is shut down. + +3. Create the mediated devices needed to configure the AP matrixes for the + two guests and to provide an interface to the vfio_ap driver for + use by the guests: + + /sys/devices/virtual/misc + --- [vfio_ap] + --------- [mdev_supported_types] + ------------ [vfio_ap-passthrough] (passthrough mediated matrix device type) + --------------- create + --------------- [devices] + + To create the mediated devices for the two guests: + + uuidgen > create + uuidgen > create + + This will create two mediated devices in the [devices] subdirectory named + with the UUID written to the create attribute file. We call them $uuid1 + and $uuid2: + + /sys/devices/virtual/misc/ + --- [vfio_ap] + --------- [mdev_supported_types] + ------------ [vfio_ap-passthrough] + --------------- [devices] + ------------------ [$uuid1] + --------------------- assign_adapter + --------------------- assign_control_domain + --------------------- assign_domain + --------------------- matrix + --------------------- unassign_adapter + --------------------- unassign_control_domain + --------------------- unassign_domain + + ------------------ [$uuid2] + --------------------- assign_adapter + --------------------- assign_control_domain + --------------------- assign_domain + --------------------- matrix + --------------------- unassign_adapter + --------------------- unassign_control_domain + --------------------- unassign_domain + +4. The administrator now needs to configure the matrixes for mediated + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). + + This is how the matrix is configured for Guest1: + + echo 5 > assign_adapter + echo 6 > assign_adapter + echo 4 > assign_domain + echo 0xab > assign_domain + + For this implementation, all usage domains - i.e., domains assigned + via the assign_domain attribute file - will also be configured in the ADM + field of the KVM guest's CRYCB, so there is no need to assign control + domains here unless you want to assign control domains that are not + assigned as usage domains. + + If a mistake is made configuring an adapter, domain or control domain, + you can use the unassign_xxx files to unassign the adapter, domain or + control domain. + + To display the matrix configuration for Guest1: + + cat matrix + + This is how the matrix is configured for Guest2: + + echo 5 > assign_adapter + echo 0x47 > assign_domain + echo 0xff > assign_domain + +5. The adminstrator now needs to activate the mediated devices. + echo 1 > activate + +6. Start Guest1: + + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... + +7. Start Guest2: + + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... + +When the guest is shut down, the mediated matrix device may be removed. + +Using our example again, to remove the mediated matrix device $uuid1: + + /sys/devices/virtual/misc + --- [vfio_ap] + --------- [mdev_supported_types] + ------------ [vfio_ap-passthrough] + --------------- [devices] + ------------------ [$uuid1] + --------------------- activate + --------------------- remove + + + echo 0 > activate + echo 1 > remove + + This will release all the AP queues for the mediated device and + remove all of the mdev matrix device's sysfs structures. To + recreate and reconfigure the mdev matrix device, all of the steps starting + with step 4 will have to be performed again. + + It is not necessary to remove an mdev matrix device, but one may want to + remove it if no guest will use it during the lifetime of the linux host. If + the mdev matrix device is removed, one may want to unbind the AP queues the + guest was using from the vfio_ap device driver and bind them back to the + default driver. Alternatively, the AP queues can be configured for another + mdev matrix (i.e., guest). In either case, one must take care to change the + secure key configured for the domain to which the queue is connected. + + +Limitations +=========== +An admin needs to be careful when writing to sysfs files +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver +to which an AP queue is bound to. Once an AP queue is bound vfio_ap +driver and assigned to a mediated device, admin should not re-assign the +AP queues back to the default driver, because of technical limitations. +The kernel does not prevent you from re-assigning from AP queues reserved +for the vfio_ap driver back to default driver. Future enhancements in +the ap_bus and vfio_ap are likely to introduce complete support for such +dynamic reconfiguration. But until then be extremely careful. + diff --git a/MAINTAINERS b/MAINTAINERS index 5adff9c424d9..5d5c44e83b38 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12426,6 +12426,7 @@ S: Supported F: drivers/s390/crypto/vfio_ap_drv.c F: drivers/s390/crypto/vfio_ap_private.h F: drivers/s390/crypto/vfio_ap_ops.c +F: Documentation/s390/vfio-ap.txt S390 ZFCP DRIVER M: Steffen Maier <maier@linux.ibm.com>