Message ID | 1363332390-12754-3-git-send-email-shangw@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 2013-03-15 at 15:26 +0800, Gavin Shan wrote: > The address (domain/bus/slot/function) of the passed PCI device > looks quite different from perspective of host and guest. Some > architectures like PPC need to setup the mapping in host. The patch > introduces additional VFIO device IOCTL command to address that. Could you explain further how this will be used? How the device is exposed to a guest is entirely a userspace construct, so why does vfio need to know or care about this? I had assumed for AER that QEMU would do the translation from host to guest address space. > Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> > --- > include/uapi/linux/vfio.h | 16 ++++++++++++++++ > 1 files changed, 16 insertions(+), 0 deletions(-) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 6e58d9b..ecc4f38 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -289,6 +289,22 @@ struct vfio_irq_set { > */ > #define VFIO_DEVICE_RESET _IO(VFIO_TYPE, VFIO_BASE + 11) > > +/** > + * VFIO_DEVICE_SET_ADDR_MAPPING - _IO(VFIO_TYPE, VFIO_BASE + 12) > + * > + * The address, which comprised of domain/bus/slot/function looks > + * different between host and guest. We need to setup the mapping > + * in host for some architectures like PPC so that the passed PCI > + * devices could support RTAS smoothly. > + */ > +struct vfio_addr_mapping { > + __u64 buid; What's a buid? Thanks, Alex > + __u8 bus; > + __u8 slot; > + __u8 func; > +}; > +#define VFIO_DEVICE_SET_ADDR_MAPPING _IO(VFIO_TYPE, VFIO_BASE + 12) > + > /* > * The VFIO-PCI bus driver makes use of the following fixed region and > * IRQ index mapping. Unimplemented regions return a size of zero. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote: > >Could you explain further how this will be used? How the device is > >exposed to a guest is entirely a userspace construct, so why does vfio > >need to know or care about this? I had assumed for AER that QEMU would > >do the translation from host to guest address space. > > > > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous > patch. The PowerNV platform is going to override it to figure out the > information for EEH core to use. On the other hand, QEMU will runs into > the IOCTL command while opening (creating) one VFIO device. > > Though I'm not familiar with AER very much. AER is quite different from > EEH. The EEH functionality implemented in PHB instead of in PCI device > core. So we don't care AER stuff in EEH directly :-) To give Alex a bit more background... EEH is our IBM specific error handling facility which is a superset of AER. IE. In addition to AER's error detection and logging, it adds a layer of error detection at the host bridge level (such as iommu violations etc...) and a mechanism for handling and recovering from errors. This is tied to our iommu domain stuff (our PE's) and our device "freezing" capability among others. With VFIO + KVM, we want to implement most of the EEH support for guests in the host kernel. The reason is multipart and we can discuss this separately as some of it might well be debatable (mostly it's more convenient that way because we hook into the underlying HW/FW EEH which isn't directly userspace accessible so we don't have to add a new layer of kernel -> user API in addition to the VFIO stuff), but there's at least one aspect of it that drives this requirement more strongly which is performance: When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do a firmware call to query the EEH state of the device and check whether it has been frozen. On some devices, that can be a performance issue, and going all the way to qemu for that would be horribly expensive. So we want at least a way to handle that call in the kernel and for that we need at least some way of mapping things there. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2013-03-16 at 06:37 +0100, Benjamin Herrenschmidt wrote: > On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote: > > >Could you explain further how this will be used? How the device is > > >exposed to a guest is entirely a userspace construct, so why does vfio > > >need to know or care about this? I had assumed for AER that QEMU would > > >do the translation from host to guest address space. > > > > > > > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous > > patch. The PowerNV platform is going to override it to figure out the > > information for EEH core to use. On the other hand, QEMU will runs into > > the IOCTL command while opening (creating) one VFIO device. > > > > Though I'm not familiar with AER very much. AER is quite different from > > EEH. The EEH functionality implemented in PHB instead of in PCI device > > core. So we don't care AER stuff in EEH directly :-) > > To give Alex a bit more background... > > EEH is our IBM specific error handling facility which is a superset of AER. > > IE. In addition to AER's error detection and logging, it adds a layer of > error detection at the host bridge level (such as iommu violations etc...) > and a mechanism for handling and recovering from errors. This is tied to > our iommu domain stuff (our PE's) and our device "freezing" capability > among others. > > With VFIO + KVM, we want to implement most of the EEH support for guests in > the host kernel. The reason is multipart and we can discuss this separately > as some of it might well be debatable (mostly it's more convenient that way > because we hook into the underlying HW/FW EEH which isn't directly userspace > accessible so we don't have to add a new layer of kernel -> user API in > addition to the VFIO stuff), but there's at least one aspect of it that drives > this requirement more strongly which is performance: > > When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do > a firmware call to query the EEH state of the device and check whether it > has been frozen. On some devices, that can be a performance issue, and > going all the way to qemu for that would be horribly expensive. > > So we want at least a way to handle that call in the kernel and for that we > need at least some way of mapping things there. There's no notification mechanism when a PHB is frozen? I suppose notification would be asynchronous so you risk data for every read that happens in the interim. So the choices are a) tell the host kernel the mapping, b) tell the guest kernel the mapping, c) identity mapping, or d) qemu intercept? Presumably your firmware call to query the EEH is not going through VFIO, so is VFIO the appropriate place to setup this mapping? As you say, this seems like just a convenient place to put it even though it really has nothing to do with the VFIO kernel component. QEMU has this information and could register it with the host kernel through other means if available. Maybe the mapping should be registered with KVM if that's how the EEH data is accessed. I'm not yet sold on why this mapping is registered here. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-03-19 at 11:24 +0800, Gavin Shan wrote: > On Mon, Mar 18, 2013 at 03:01:14PM -0600, Alex Williamson wrote: > >On Sat, 2013-03-16 at 06:37 +0100, Benjamin Herrenschmidt wrote: > >> On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote: > >> > >Could you explain further how this will be used? How the device is > >> > >exposed to a guest is entirely a userspace construct, so why does vfio > >> > >need to know or care about this? I had assumed for AER that QEMU would > >> > >do the translation from host to guest address space. > >> > > > >> > > >> > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous > >> > patch. The PowerNV platform is going to override it to figure out the > >> > information for EEH core to use. On the other hand, QEMU will runs into > >> > the IOCTL command while opening (creating) one VFIO device. > >> > > >> > Though I'm not familiar with AER very much. AER is quite different from > >> > EEH. The EEH functionality implemented in PHB instead of in PCI device > >> > core. So we don't care AER stuff in EEH directly :-) > >> > >> To give Alex a bit more background... > >> > >> EEH is our IBM specific error handling facility which is a superset of AER. > >> > >> IE. In addition to AER's error detection and logging, it adds a layer of > >> error detection at the host bridge level (such as iommu violations etc...) > >> and a mechanism for handling and recovering from errors. This is tied to > >> our iommu domain stuff (our PE's) and our device "freezing" capability > >> among others. > >> > >> With VFIO + KVM, we want to implement most of the EEH support for guests in > >> the host kernel. The reason is multipart and we can discuss this separately > >> as some of it might well be debatable (mostly it's more convenient that way > >> because we hook into the underlying HW/FW EEH which isn't directly userspace > >> accessible so we don't have to add a new layer of kernel -> user API in > >> addition to the VFIO stuff), but there's at least one aspect of it that drives > >> this requirement more strongly which is performance: > >> > >> When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do > >> a firmware call to query the EEH state of the device and check whether it > >> has been frozen. On some devices, that can be a performance issue, and > >> going all the way to qemu for that would be horribly expensive. > >> > >> So we want at least a way to handle that call in the kernel and for that we > >> need at least some way of mapping things there. > > > >There's no notification mechanism when a PHB is frozen? I suppose > >notification would be asynchronous so you risk data for every read that > >happens in the interim. So the choices are a) tell the host kernel the > >mapping, b) tell the guest kernel the mapping, c) identity mapping, or > >d) qemu intercept? > > > > We do have dedicated interrupts on detecting frozen PHB on host side. > However, the guest has to poll/check the frozen state (frozen PE) during > access to config or MMIO space. Can you make use of something like this to notify the guest: https://github.com/awilliam/linux-vfio/commit/dad9f8972e04cd081a028d8fb1249d746d97fc03 As a first step this only notifies QEMU, but the plan is to forward that on to the guest. If we can leverage similar interfaces between AER and EEH, I'd obviously like to do that. > For the recommended methods, (a) is what > we want to do with the patchset. (b) seems infeasible since the guest > shouldn't be aware of hypervisor (e.g. KVM or PowerVM) it's running on > top of, it's hard to polish the guest to do it. (d) sounds applicable > since the QEMU should know the address (BDF) of host and guest devices. > However, we still need let the host EEH core know that which PCI device > has been passed to guest and the best place to do that would be when opening > the corresponding VFIO PCI device. In turn, it will still need weak function > for ppc platform to override it. Why we not directly take (a) to finish > everything in one VFIO IOCTL command? Because it seems like VFIO is just being used as a relay and has no purpose knowing this information on it's own. It's just a convenient place to host the ioctl, but that alone is not a good enough reason to put it there. > Sorry, Alex. I didn't understand (c) well :-) (c) is if the buid/bus/slot/func exposed to the guest matches the same for the host, then there's no need for mapping translation. > >Presumably your firmware call to query the EEH is not going through > >VFIO, so is VFIO the appropriate place to setup this mapping? As you > >say, this seems like just a convenient place to put it even though it > >really has nothing to do with the VFIO kernel component. QEMU has this > >information and could register it with the host kernel through other > >means if available. Maybe the mapping should be registered with KVM if > >that's how the EEH data is accessed. I'm not yet sold on why this > >mapping is registered here. Thanks, > > > > Yes, EEH firmware call needn't going through VFIO. However, EEH has > very close relationship with PCI and so VFIO-PCI does. Eventually, EEH > has close relationship with VFIO-PCI :-) Is there some plan to do more with EEH through VFIO in the future or are we just talking about some kind of weird associative property to sell this ioctl? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-03-18 at 22:18 -0600, Alex Williamson wrote: > > Yes, EEH firmware call needn't going through VFIO. However, EEH has > > very close relationship with PCI and so VFIO-PCI does. Eventually, EEH > > has close relationship with VFIO-PCI :-) > > Is there some plan to do more with EEH through VFIO in the future or are > we just talking about some kind of weird associative property to sell > this ioctl? Thanks, Well, I'm not sure how 'weird' that is but it makes sense to me... VFIO is the mechanism that virtualizes access to a PCI device and provides interfaces to qemu & kvm to access it &| map it. Or rather VFIO-PCI is. At a basic level it provides ... the basic PCI interfaces, ie, config space access (with or without filtering), interrupts, etc... In our environment, EEH is just another functionality of PCI really. The firmware calls used by the guest to do that fall into more or less the same category as the ones used for PCI config space accesses, manipulation of DMA windows, etc... Similar to host (though guest and host use a different FW interface for various reasons). So it's very natural to "transport" these via VFIO-PCI like everything else, I don't see a more natural place to put the ioctl's we need for qemu to be able to access the EEH state, trigger EEH (un)isolation, resets, etc... Fundamentally, the design should be for VFIO-PCI to provide some specific ioctls for EEH that userspace programs such as qemu can use, and then re-expose those APIs to the guest. In addition, to do some of it in the kernel for performance reason, we want to establish that mapping, but I see that as a VFIO "accelerator". IE. Whatever is going to respond to the EEH calls from the guest in-kernel will have to share state with the rest of the EEH stuff provided to qemu by vfio-pci. Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2013-03-19 at 05:45 +0100, Benjamin Herrenschmidt wrote: > On Mon, 2013-03-18 at 22:18 -0600, Alex Williamson wrote: > > > Yes, EEH firmware call needn't going through VFIO. However, EEH has > > > very close relationship with PCI and so VFIO-PCI does. Eventually, EEH > > > has close relationship with VFIO-PCI :-) > > > > Is there some plan to do more with EEH through VFIO in the future or are > > we just talking about some kind of weird associative property to sell > > this ioctl? Thanks, > > Well, I'm not sure how 'weird' that is but it makes sense to me... VFIO > is the mechanism that virtualizes access to a PCI device and provides > interfaces to qemu & kvm to access it &| map it. > > Or rather VFIO-PCI is. > > At a basic level it provides ... the basic PCI interfaces, ie, config > space access (with or without filtering), interrupts, etc... > > In our environment, EEH is just another functionality of PCI really. > The firmware calls used by the guest to do that fall into more or less > the same category as the ones used for PCI config space accesses, > manipulation of DMA windows, etc... Similar to host (though guest > and host use a different FW interface for various reasons). > > So it's very natural to "transport" these via VFIO-PCI like everything > else, I don't see a more natural place to put the ioctl's we need for > qemu to be able to access the EEH state, trigger EEH (un)isolation, > resets, etc... > > Fundamentally, the design should be for VFIO-PCI to provide some specific > ioctls for EEH that userspace programs such as qemu can use, and then > re-expose those APIs to the guest. > > In addition, to do some of it in the kernel for performance reason, we > want to establish that mapping, but I see that as a VFIO "accelerator". > > IE. Whatever is going to respond to the EEH calls from the guest in-kernel > will have to share state with the rest of the EEH stuff provided to qemu > by vfio-pci. Perhaps my problem is that I don't have a clear picture of where you're going with this like I do for AER. For AER we're starting with notification of an error, from that we build into how to retrieve the error information, and finally how to perform corrective action. Each of these will be done through vifo-pci. Here we're starting by registering a mapping that's really only useful to the vfio "accelerator" path, but we don't even have a hint of what the non-accelerated path is and how vfio is involved with it. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2013-03-20 at 12:48 -0600, Alex Williamson wrote: > Perhaps my problem is that I don't have a clear picture of where > you're > going with this like I do for AER. For AER we're starting with > notification of an error, from that we build into how to retrieve the > error information, and finally how to perform corrective action. Each > of these will be done through vifo-pci. > > Here we're starting by registering a mapping that's really only useful > to the vfio "accelerator" path, but we don't even have a hint of what > the non-accelerated path is and how vfio is involved with it. Thanks, I'm surprised that you are building so much policy around AER ... can't you just pass the raw stuff down to the guest and let the guest do it's own corrective actions ? As for EEH, I will let Gavin describe in more details what he is doing, though I wouldn't be surprised if so far he doesn't have a non-accelerated path :-) Which indeed makes things oddball, granted ... at least for now. I *think* what Gavin's doing right now is a pass-through to the host EEH directly in the kernel, so without a slow path... Gavin, it really boils down to that. In-kernel EEH for guests is a KVMism that ends up not involving VFIO in any other way than establishing the mapping, then arguably it could be done via a VM ioctl. If there's more going through VFIO and shared state, then it should probably go through VFIO-PCI. Note (FYI) that EEH somewhat encompass AER... the EEH logic triggers on AER errors as well and the error reports generated by the firmware contain the AER register dump in addition to the bridge internal stuff. IE. EEH encompass pretty much all sort of errors, correctable or not, that happens on PCI. It adds a mechanism of "isolation" of domains on first error (involving blocking MMIOs, DMAs and MSIs) which helps with preventing propagation of bad data, and various recovery schemes. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2013-03-20 at 20:31 +0100, Benjamin Herrenschmidt wrote: > On Wed, 2013-03-20 at 12:48 -0600, Alex Williamson wrote: > > Perhaps my problem is that I don't have a clear picture of where > > you're > > going with this like I do for AER. For AER we're starting with > > notification of an error, from that we build into how to retrieve the > > error information, and finally how to perform corrective action. Each > > of these will be done through vifo-pci. > > > > Here we're starting by registering a mapping that's really only useful > > to the vfio "accelerator" path, but we don't even have a hint of what > > the non-accelerated path is and how vfio is involved with it. Thanks, > > I'm surprised that you are building so much policy around AER ... can't > you just pass the raw stuff down to the guest and let the guest do it's > own corrective actions ? How does the guest get the raw stuff? We need to get the AER interrupt out to the guest so it can be injected into the virtual PCIe port, then we need to be able to retrieve the physical device log and pass it to the qemu to mangle to match the guest topology. We don't have existing firmware interfaces for the guest to do that, so it's all being routed through vfio-pci. > As for EEH, I will let Gavin describe in more details what he is doing, > though I wouldn't be surprised if so far he doesn't have a > non-accelerated path :-) Which indeed makes things oddball, granted ... > at least for now. I *think* what Gavin's doing right now is a > pass-through to the host EEH directly in the kernel, so without a slow > path... > > Gavin, it really boils down to that. In-kernel EEH for guests is a > KVMism that ends up not involving VFIO in any other way than > establishing the mapping, then arguably it could be done via a VM ioctl. > > If there's more going through VFIO and shared state, then it should > probably go through VFIO-PCI. Exactly my thinking. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 6e58d9b..ecc4f38 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -289,6 +289,22 @@ struct vfio_irq_set { */ #define VFIO_DEVICE_RESET _IO(VFIO_TYPE, VFIO_BASE + 11) +/** + * VFIO_DEVICE_SET_ADDR_MAPPING - _IO(VFIO_TYPE, VFIO_BASE + 12) + * + * The address, which comprised of domain/bus/slot/function looks + * different between host and guest. We need to setup the mapping + * in host for some architectures like PPC so that the passed PCI + * devices could support RTAS smoothly. + */ +struct vfio_addr_mapping { + __u64 buid; + __u8 bus; + __u8 slot; + __u8 func; +}; +#define VFIO_DEVICE_SET_ADDR_MAPPING _IO(VFIO_TYPE, VFIO_BASE + 12) + /* * The VFIO-PCI bus driver makes use of the following fixed region and * IRQ index mapping. Unimplemented regions return a size of zero.
The address (domain/bus/slot/function) of the passed PCI device looks quite different from perspective of host and guest. Some architectures like PPC need to setup the mapping in host. The patch introduces additional VFIO device IOCTL command to address that. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> --- include/uapi/linux/vfio.h | 16 ++++++++++++++++ 1 files changed, 16 insertions(+), 0 deletions(-)