Message ID | 20210702095849.1610-1-shameerali.kolothum.thodi@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | vfio/hisilicon: add acc live migration driver | expand |
On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote: > This series attempts to add vfio live migration support for > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space > includes both the functional register space and migration > control register space. As discussed in RFCv1[0], this may create > security issues as these regions get shared between the Guest > driver and the migration driver. Based on the feedback, we tried > to address those concerns in this version. > > This is now based on the new vfio-pci-core framework proposal[1]. > Understand that the framework proposal is still under discussion, > but really appreciate any feedback on the approach taken here > to mitigate the security risks. Hi, can you look at the v6 proposal for the mlx5 implementation of the migration API and see if it meets hisilicon acc's needs as well? https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/ There are few topics to consider: - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make sense for this driver? I see pf_qm_state_pre_save() but didn't understand why it wanted to send the first 32 bytes in the PRECOPY mode? It is fine, but it will add some complexity to continue to do this. - I think we discussed the P2P implementation and decided it would work for this device? Can you re-read and confirm? - Are the arcs we defined going to work here as well? The current implementation in hisi_acc_vf_set_device_state() is very far away from what the v1 protocol is, so I'm having a hard time guessing, but.. RESUMING -> STOP Probably vf_qm_state_resume() RUNNING -> STOP vf_qm_fun_restart() - that is oddly named.. STOP -> RESUMING Seems to be a nop (likely a bug) STOP -> RUNNING Not implemented currenty? (also a bug) STOP -> STOP_COPY pf_qm_state_pre_save / vf_qm_state_save STOP_COPY -> STOP NOP And the modification for the P2P/NO DMA is presumably just fun_restart too since stopping the device and stopping DMA are going to be the same thing here? The mlx5 implementation linked above is a full example you can cut and paste from for how to implement the state function and the how to do the data transfer. The f_ops read/write implementation for acc looks trivial as it only streams the fixed size and pre-allocated 'struct acc_vf_data' It looks like it would be a short path to implement our v2 proposal and remove a lot of driver code, as we saw in mlx5. Thanks, Jason
> -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 02 February 2022 13:15 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-crypto@vger.kernel.org; alex.williamson@redhat.com; > mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com> > Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver > > On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote: > > This series attempts to add vfio live migration support for > > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space > > includes both the functional register space and migration > > control register space. As discussed in RFCv1[0], this may create > > security issues as these regions get shared between the Guest > > driver and the migration driver. Based on the feedback, we tried > > to address those concerns in this version. > > > > This is now based on the new vfio-pci-core framework proposal[1]. > > Understand that the framework proposal is still under discussion, > > but really appreciate any feedback on the approach taken here > > to mitigate the security risks. > > Hi, can you look at the v6 proposal for the mlx5 implementation of the > migration API and see if it meets hisilicon acc's needs as well? > > https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/ Yes, I saw that one. Thanks for that and is now looking into it. > > There are few topics to consider: > - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make > sense for this driver? I think it will be STOP_COPY only for now. We might have PRECOPY feature once we have the SMMUv3 HTTU support in future. > > I see pf_qm_state_pre_save() but didn't understand why it wanted to > send the first 32 bytes in the PRECOPY mode? It is fine, but it > will add some complexity to continue to do this. That was mainly to do a quick verification between src and dst compatibility before we start saving the state. I think probably we can delay that check for later. > - I think we discussed the P2P implementation and decided it would > work for this device? Can you re-read and confirm? In our case these devices are Integrated End Point devices and doesn't have P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature I guess. Also, since we cannot guarantee a NDMA state in STOP, my assumption currently is the onus of making sure that no MMIO access happens in STOP is on the user. Is that a valid assumption? > - Are the arcs we defined going to work here as well? The current > implementation in hisi_acc_vf_set_device_state() is very far away > from what the v1 protocol is, so I'm having a hard time guessing, > but.. Right. The FSM has changed a couple of times since we posted this. I am going to rebase all that now. > RESUMING -> STOP > Probably vf_qm_state_resume() > > RUNNING -> STOP > vf_qm_fun_restart() - that is oddly named.. > > STOP -> RESUMING > Seems to be a nop (likely a bug) > > STOP -> RUNNING > Not implemented currenty? (also a bug) > > STOP -> STOP_COPY > pf_qm_state_pre_save / vf_qm_state_save > > STOP_COPY -> STOP > NOP I will check and verify this. > And the modification for the P2P/NO DMA is presumably just > fun_restart too since stopping the device and stopping DMA are > going to be the same thing here? Yes, in our case stopping device and stopping DMA are effectively the same thing. > > The mlx5 implementation linked above is a full example you can cut and > paste from for how to implement the state function and the how to do > the data transfer. The f_ops read/write implementation for acc looks > trivial as it only streams the fixed size and pre-allocated 'struct > acc_vf_data' > > It looks like it would be a short path to implement our v2 proposal > and remove a lot of driver code, as we saw in mlx5. > Ok. These are the git repo I am using for the rework, https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2 https://github.com/jgunthorpe/linux/tree/vfio_migration_v2 Please let me know if the above are not up to date. Also, just noted that my quick prototype is now failing with below error, " Error: VFIO device doesn't support migration" Do we need to set the below before the feature query? Or am I using a wrong Qemu/kernel repo? --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice *vbasedev, uint64_t *mig_flags) struct vfio_device_feature_migration *mig = (void *)feature->data; feature->argsz = sizeof(buf); + feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET; if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0) return -EOPNOTSUPP; Thanks, Shameer
On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi wrote: > > There are few topics to consider: > > - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make > > sense for this driver? > > I think it will be STOP_COPY only for now. We might have PRECOPY feature once > we have the SMMUv3 HTTU support in future. HTTU is the dirty tracking feature? To be clear VFIO migration support for PRECOPY has nothing to do with IOMMU based dirty page tracking. > > - I think we discussed the P2P implementation and decided it would > > work for this device? Can you re-read and confirm? > > In our case these devices are Integrated End Point devices and doesn't have > P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature > I guess. Also, since we cannot guarantee a NDMA state in STOP, my > assumption currently is the onus of making sure that no MMIO access happens > in STOP is on the user. Is that a valid assumption? Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO access to sustain it. (and I'm wondering sometimes if we should rename RUNNING_P2P to STOP_P2P - ie the device is stopped but still allows inbound P2P to make this clearer) > Do we need to set the below before the feature query? > Or am I using a wrong Qemu/kernel repo? > > +++ b/hw/vfio/migration.c > @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice > *vbasedev, uint64_t *mig_flags) > struct vfio_device_feature_migration *mig = (void *)feature->data; > > feature->argsz = sizeof(buf); > + feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET; > if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0) > return -EOPNOTSUPP; Oh, this is my mistake I thought this got pushed to that github already but didn't, I updated it. If you have a prototype can you post another RFC? Thanks, Jason
> -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 02 February 2022 15:40 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > linux-crypto@vger.kernel.org; alex.williamson@redhat.com; > mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com> > Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver > > On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi > wrote: > > > > There are few topics to consider: > > > - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make > > > sense for this driver? > > > > I think it will be STOP_COPY only for now. We might have PRECOPY > > feature once we have the SMMUv3 HTTU support in future. > > HTTU is the dirty tracking feature? To be clear VFIO migration support for > PRECOPY has nothing to do with IOMMU based dirty page tracking. Yes, it is based on the IOMMU hardware dirty bit management support. A RFC was posted sometime back, https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/ Ok, my guess was that the PRECOPY here was related. Thanks for clarifying. > > > > - I think we discussed the P2P implementation and decided it would > > > work for this device? Can you re-read and confirm? > > > > In our case these devices are Integrated End Point devices and doesn't > > have P2P DMA capability. Hence the FSM arcs will be limited to > > STOP_COPY feature I guess. Also, since we cannot guarantee a NDMA > > state in STOP, my assumption currently is the onus of making sure that > > no MMIO access happens in STOP is on the user. Is that a valid assumption? > > Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO > access to sustain it. Ok. > (and I'm wondering sometimes if we should rename RUNNING_P2P to > STOP_P2P - ie the device is stopped but still allows inbound P2P to make this > clearer) > > > Do we need to set the below before the feature query? > > Or am I using a wrong Qemu/kernel repo? > > > > +++ b/hw/vfio/migration.c > > @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice > > *vbasedev, uint64_t *mig_flags) > > struct vfio_device_feature_migration *mig = (void > > *)feature->data; > > > > feature->argsz = sizeof(buf); > > + feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | > > + VFIO_DEVICE_FEATURE_GET; > > if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0) > > return -EOPNOTSUPP; > > Oh, this is my mistake I thought this got pushed to that github already but > didn't, I updated it. Ok. Thanks. > If you have a prototype can you post another RFC? Sure, will do. I just started and has only a skeleton proto based on v2 now. Will send out a RFC soon once I have all the FSM arcs implemented and sanity tested. Thanks, Shameer
On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote: > > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > Sent: 02 February 2022 15:40 > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; > > linux-crypto@vger.kernel.org; alex.williamson@redhat.com; > > mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang > > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; > > yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron > > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com> > > Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver > > > > On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi > > wrote: > > > > > > There are few topics to consider: > > > > - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make > > > > sense for this driver? > > > > > > I think it will be STOP_COPY only for now. We might have PRECOPY > > > feature once we have the SMMUv3 HTTU support in future. > > > > HTTU is the dirty tracking feature? To be clear VFIO migration support for > > PRECOPY has nothing to do with IOMMU based dirty page tracking. > > Yes, it is based on the IOMMU hardware dirty bit management support. > A RFC was posted sometime back, > https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/ Yes, I saw that. I was hoping to have a discussion on this soon about how to integrate that with the iommufd work, which I hope will allow that series, and the other IOMMU drivers that can support this to be merged.. Jason
On Wed, 2 Feb 2022 14:34:52 +0000 Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > -----Original Message----- > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > > > > I see pf_qm_state_pre_save() but didn't understand why it wanted to > > send the first 32 bytes in the PRECOPY mode? It is fine, but it > > will add some complexity to continue to do this. > > That was mainly to do a quick verification between src and dst compatibility > before we start saving the state. I think probably we can delay that check > for later. In the v1 migration scheme, this was considered good practice. It shouldn't be limited to PRECOPY, as there's no requirement to use PRECOPY, but the earlier in the migration process that we can trigger a device or data stream compatibility fault, the better. TBH, even in the case where a device doesn't support live dirty tracking for a PRECOPY phase, using it for compatibility testing continues to seem like good practice. Thanks, Alex
On Wed, Feb 02, 2022 at 10:30:41AM -0700, Alex Williamson wrote: > On Wed, 2 Feb 2022 14:34:52 +0000 > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > > > > > > > I see pf_qm_state_pre_save() but didn't understand why it wanted to > > > send the first 32 bytes in the PRECOPY mode? It is fine, but it > > > will add some complexity to continue to do this. > > > > That was mainly to do a quick verification between src and dst compatibility > > before we start saving the state. I think probably we can delay that check > > for later. > > In the v1 migration scheme, this was considered good practice. It > shouldn't be limited to PRECOPY, as there's no requirement to use > PRECOPY, but the earlier in the migration process that we can trigger a > device or data stream compatibility fault, the better. TBH, even in > the case where a device doesn't support live dirty tracking for a > PRECOPY phase, using it for compatibility testing continues to seem > like good practice. At least with our thinking here, we'd rather the device expose an explicit compatibility data via get/test system calls so we can build proper infrastructure around this. Every device will have compatibility requirements and we can build more shared common code this way. ie qemu can ideally fetch the data before migration starts and do an exchange with the live migration target to see if it is OK. Orchestration can inventory the systems, and automation can select live migration targets that can actually work. If it is hidden inside the migration stream it is too invisible to be fully useful. This is something we've been talking about here but don't have much concrete to say for mlx5 yet. The device still has to self-protect itself against a corrupted migration stream impacting integrity, of course. IIRC qemu has a nice spot to put this in the existing protocol. Just overall, now that PRECOPY is optional, we should avoid using it without a good reason. The driver implementation does have a cost. Thanks, Jason
On 2/2/22 17:03, Jason Gunthorpe wrote: > On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote: >>> From: Jason Gunthorpe [mailto:jgg@nvidia.com] >>> Sent: 02 February 2022 15:40 >>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> >>> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org; >>> linux-crypto@vger.kernel.org; alex.williamson@redhat.com; >>> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang >>> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; >>> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron >>> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com> >>> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver >>> >>> On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi >>> wrote: >>> >>>>> There are few topics to consider: >>>>> - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make >>>>> sense for this driver? >>>> >>>> I think it will be STOP_COPY only for now. We might have PRECOPY >>>> feature once we have the SMMUv3 HTTU support in future. >>> >>> HTTU is the dirty tracking feature? To be clear VFIO migration support for >>> PRECOPY has nothing to do with IOMMU based dirty page tracking. >> >> Yes, it is based on the IOMMU hardware dirty bit management support. >> A RFC was posted sometime back, >> https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/ > > Yes, I saw that. I was hoping to have a discussion on this soon about Sorry to snatch the thread, I was thinking in doing the same this week, as I was playing around that area, and wanted to ask the community... but since you mentioned it :D > how to integrate that with the iommufd work, which I hope will allow > that series, and the other IOMMU drivers that can support this to be > merged.. The iommu-fd thread wasn't particularly obvious on how dirty tracking is done there, but TBH I am not up to speed on iommu-fd yet so I missed something obvious for sure. When you say 'integrate that with the iommufd' can you expand on that? Did you meant to use interface in the link, or perhaps VFIO would use an iommufd /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the VMM to scan iommu pagetables itself and see what was marked dirty or not? My over-simplistic/naive view was that the proposal in the link above sounded a lot simpler. While iommu-fd had more longevity for many other usecases outside dirty tracking, no? I have a PoC-ish using the interface in the link, with AMD IOMMU dirty bit supported (including Qemu emulated amd-iommu for folks lacking the hardware). Albeit the eager-spliting + collapsing of IOMMU hugepages is not yet done there, and I wanted to play around the emulated intel-iommu SLADS from specs looks quite similar. Happy to join existing effort anyways.
On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote: > On 2/2/22 17:03, Jason Gunthorpe wrote: > > how to integrate that with the iommufd work, which I hope will allow > > that series, and the other IOMMU drivers that can support this to be > > merged.. > > The iommu-fd thread wasn't particularly obvious on how dirty tracking is done > there, but TBH I am not up to speed on iommu-fd yet so I missed something > obvious for sure. When you say 'integrate that with the iommufd' can you > expand on that? The general idea is that iommufd is the place to put all the iommu driver uAPI for consumption by userspace. The IOMMU feature of dirty tracking would belong there. So, some kind of API needs to be designed to meet the needs of the IOMMU drivers. > Did you meant to use interface in the link, or perhaps VFIO would use an iommufd > /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's > not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the > VMM to scan iommu pagetables itself and see what was marked dirty or > not? iommufd and VFIO container's don't co-exist, either iommufd is providing the IOMMU interface, or the current type 1 code - not both together. iommfd current approach presents the same ABI as the type1 container as compatability, and it is a possible direction to provide the iommu_domain stored dirty bits through that compat API. But, as you say, it looks unnatural and inefficient when the domain itself is storing the dirty bits inside the IOPTE. It need some study I haven't got into yet :) > My over-simplistic/naive view was that the proposal in the link > above sounded a lot simpler. While iommu-fd had more longevity for > many other usecases outside dirty tracking, no? I'd prefer we don't continue to hack on the type1 code if iommufd is expected to take over in this role - especially for a multi-vendor feature like dirty tracking. It is actually a pretty complicated topic because migration capable PCI devices are also include their own dirty tracking HW, all this needs to be harmonized somehow. VFIO proposed to squash everything into the container code, but I've been mulling about having iommufd only do system iommu and push the PCI device internal tracking over to VFIO. > I have a PoC-ish using the interface in the link, with AMD IOMMU > dirty bit supported (including Qemu emulated amd-iommu for folks > lacking the hardware). Albeit the eager-spliting + collapsing of > IOMMU hugepages is not yet done there, and I wanted to play around > the emulated intel-iommu SLADS from specs looks quite similar. Happy > to join existing effort anyways. This sounds great, I would love to bring the AMD IOMMU along with a dirty tracking implementation! Can you share some patches so we can see what the HW implementation looks like? Jason
On 2/3/22 15:18, Jason Gunthorpe wrote: > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote: >> On 2/2/22 17:03, Jason Gunthorpe wrote: >>> how to integrate that with the iommufd work, which I hope will allow >>> that series, and the other IOMMU drivers that can support this to be >>> merged.. >> >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done >> there, but TBH I am not up to speed on iommu-fd yet so I missed something >> obvious for sure. When you say 'integrate that with the iommufd' can you >> expand on that? > > The general idea is that iommufd is the place to put all the iommu > driver uAPI for consumption by userspace. The IOMMU feature of dirty > tracking would belong there. > > So, some kind of API needs to be designed to meet the needs of the > IOMMU drivers. > /me nods I am gonna assume below is the most up-to-date to iommufd (as you pointed out in another thread IIRC): https://github.com/jgunthorpe/linux iommufd Let me know if it's not :) >> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd >> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's >> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the >> VMM to scan iommu pagetables itself and see what was marked dirty or >> not? > > iommufd and VFIO container's don't co-exist, either iommufd is > providing the IOMMU interface, or the current type 1 code - not both > together. > > iommfd current approach presents the same ABI as the type1 container > as compatability, and it is a possible direction to provide the > iommu_domain stored dirty bits through that compat API. > > But, as you say, it looks unnatural and inefficient when the domain > itself is storing the dirty bits inside the IOPTE. > How much is this already represented as the io-pgtable in IOMMU internal kAPI (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :( > It need some study I haven't got into yet :) > Heh :) Depending on how easy it is to obtain full extent of IO pagetables via iommu_fd and whether userspace code can scan the dirty bits on its own ... then potentially VMM/process can more efficiently scan the dirtied set? But if some layer needs to somehow mediate between the vendor IOPTE representation and an UAPI IOPTE representation, to be able to make that delegation to userspace ... then maybe both might be inefficient? I didn't see how iommu-fd would abstract the IOPTEs lookup as far as I glanced through the code, perhaps that's another ioctl(). >> My over-simplistic/naive view was that the proposal in the link >> above sounded a lot simpler. While iommu-fd had more longevity for >> many other usecases outside dirty tracking, no? > > I'd prefer we don't continue to hack on the type1 code if iommufd is > expected to take over in this role - especially for a multi-vendor > feature like dirty tracking. > But what strikes /specifically/ on the dirty bit feature is that it looks simpler with the current VFIO, the heavy lifting seems to be mostly on the IOMMU vendor. The proposed API above for VFIO looking at the container (small changes), and IOMMU vendor would do most of it: * Toggling API for start/stop dirty tracking * API to get the IOVAs dirtied * API to clear the IOVAs dirtied [this last one I am not sure it is needed as the clear could be done as we scan the ones, thus minimize TLB flush cost if these are separate] At the same time, what particularly scares me perf-wise (for the device being migrated) ... is the fact that we need to dynamically split and collapse page tables to increase the granularity of which we track. In the above interface it splits/collapses when you turn on/off the dirty tracking (respectively). That's *probably* where we need more flexibility, not sure. > It is actually a pretty complicated topic because migration capable > PCI devices are also include their own dirty tracking HW, all this > needs to be harmonized somehow. Do you have thoughts on what such device-dirty interface could look like? (Perhaps too early to poke while the FSM/UAPI is being worked out) I was wondering if container has a dirty scan/sync callback funnelled by a vendor IOMMU ops implemented (as Shameerali patches proposed), and vfio vendor driver provides one per device. Or propagate the dirty tracking API to vendor vfio driver[*]. The reporting of the dirtying, though, looks hazzy to achieve if you try to make it uniform even to userspace. Perhaps with iommu-fd you're thinking to mmap() the dirty region back to userspace, or an iommu-fd ioctl() updates the PTEs, while letting the kernel clear the dirty status via the mmap() object. And that would be the common API regardless of dirty-hw scheme. Anyway, just thinking out loud. [*] considering the device may choose where to place its tracking storage, and which scheme (bitmap, ring, etc) it might be. > VFIO proposed to squash everything > into the container code, but I've been mulling about having iommufd > only do system iommu and push the PCI device internal tracking over to > VFIO. > Seems to me that the juicy part falls mostly in IOMMU vendor code, I am not sure yet how much one can we 'offload' to a generic layer, at least compared with this other proposal. >> I have a PoC-ish using the interface in the link, with AMD IOMMU >> dirty bit supported (including Qemu emulated amd-iommu for folks >> lacking the hardware). Albeit the eager-spliting + collapsing of >> IOMMU hugepages is not yet done there, and I wanted to play around >> the emulated intel-iommu SLADS from specs looks quite similar. Happy >> to join existing effort anyways. > > This sounds great, I would love to bring the AMD IOMMU along with a > dirty tracking implementation! Can you share some patches so we can > see what the HW implementation looks like? Oh yes for sure! As I said I'm happy to help&implement along. I would really love to leverage the IOMMU feature, as that relieves the migrateable PCI device having to do the dirty tracking themselves. And well, we seem to be getting there -- spec-wise everybody has that feature *at least* documented :) Give me some time (few days only, as I gotta sort some things) and I'll respond here as follow up with link to a branch with the WIP/PoC patches. Summing up a couple of remarks on hw, hopefully they enlight: 1) On AMD the feature is advertised by their extended feature register as supported or not. On Intel, same in its equivalent (ECAP). On course different bits on different IOMMUs registers. If it is supported, the IOMMU updates (when activated) two bits per page table entry to indicate 'access' and 'dirty'. Slightly different page table format bit-wise on where access/dirty is located (between the two vendors). 2) Change protection domain flags to enable the dirty/access tracking. On AMD, it's the DTE flags (a new flag for dirty tracking). [Intel does this in the PASID table entry] 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty bit is cheap, clearing them is 'expensive' because one needs to flush IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result of an address-translation/io-page-walk. Even though the IOMMU uses interlocked operations to actually update the Access/Dirty bit in concurrency with the CPU. The AMD manuals are a tad misleading as they talk about marking non-present, but that would be catastrophic for migration as it would mean a DMA target abort for the PCI device, unless I missed something obvious. In any case, this means that the dirty bit *clearing* needs to be batched as much as possible, to amortize the cost of flushing the IOTLB. This is the same for Intel *IIUC*. 4) Adjust the granularity of pagetables in place: [This item wasn't done, but it is generic to any IOMMU because it is mostly the ability to split existing IO pages in place.] 4.a) Eager splitting and late collapsing IO hugepages -- essentially to have tracking at finer granularity. as you know generally the IOMMU map is generally done on the biggest IO page size, specially on guests. We sadly can't write-protect or anything fancier that we usually do so otherwise the PCI device gets a DMA target-abort :( So splitting, when we start dirty tracking (what I mean by eager), and do for the whole IO page table. When finished tracking, collapse the pages (which should probably be deemed optional in the common case of succeeding the migration). Guest expected to have higher IOTLB miss. This part is particularly worrying for the IO guest performance, but hopefully it doesn't turn out to be too bad. 4.b) Optionally starting dirtying earlier (at provisioning) and let userspace dynamically split pages. This is to hopefully minimize the IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly. So dirty tracking would be enabled at creation of the protection domain after the vfio container is set up, and we would use pages dirtied as a indication of what needs to be splited. Problem is for IO page sizes bigger than 1G, which might unnecessarily lead to marking too much as dirty early on; but at least it's better than transferring the whole set. Joao
On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote: > On 2/3/22 15:18, Jason Gunthorpe wrote: > > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote: > >> On 2/2/22 17:03, Jason Gunthorpe wrote: > >>> how to integrate that with the iommufd work, which I hope will allow > >>> that series, and the other IOMMU drivers that can support this to be > >>> merged.. > >> > >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done > >> there, but TBH I am not up to speed on iommu-fd yet so I missed something > >> obvious for sure. When you say 'integrate that with the iommufd' can you > >> expand on that? > > > > The general idea is that iommufd is the place to put all the iommu > > driver uAPI for consumption by userspace. The IOMMU feature of dirty > > tracking would belong there. > > > > So, some kind of API needs to be designed to meet the needs of the > > IOMMU drivers. > > > /me nods > > I am gonna assume below is the most up-to-date to iommufd (as you pointed > out in another thread IIRC): > > https://github.com/jgunthorpe/linux iommufd > > Let me know if it's not :) The iommufd part is pretty good, but there is hacky patch to hook it into vfio that isn't there, if you want to actually try it. > > But, as you say, it looks unnatural and inefficient when the domain > > itself is storing the dirty bits inside the IOPTE. > > How much is this already represented as the io-pgtable in IOMMU internal kAPI > (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today > used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :( Which are you looking at? AFACIT there is no diry page support in iommu_ops ? > then potentially VMM/process can more efficiently scan the dirtied > set? But if some layer needs to somehow mediate between the vendor > IOPTE representation and an UAPI IOPTE representation, to be able to > make that delegation to userspace ... then maybe both might be > inefficient? I didn't see how iommu-fd would abstract the IOPTEs > lookup as far as I glanced through the code, perhaps that's another > ioctl(). It is based around the same model as VFIO container - map/unmap of user address space into the IOPTEs and the user space doesn't see anything resembling a 'pte' - at least for kernel owned IO page tables. User space page tables will not be abstracted and the userspace must know the direct HW format of the IOMMU they is being used. > But what strikes /specifically/ on the dirty bit feature is that it looks > simpler with the current VFIO, the heavy lifting seems to be > mostly on the IOMMU vendor. The proposed API above for VFIO looking at > the container (small changes), and IOMMU vendor would do most of it: It is basically the same, almost certainly the user API in iommufd will be some 'get dirty bits' and 'unmap and give me the dirty bits' just like vfio has. The tricky details are around how do you manage this when the system may have multiple things invovled capable, or not, of actualy doing dirty tracking. > At the same time, what particularly scares me perf-wise (for the > device being migrated) ... is the fact that we need to dynamically > split and collapse page tables to increase the granularity of which > we track. In the above interface it splits/collapses when you turn > on/off the dirty tracking (respectively). That's *probably* where we > need more flexibility, not sure. For sure that is a particularly big adventure in the iommu driver.. > Do you have thoughts on what such device-dirty interface could look like? > (Perhaps too early to poke while the FSM/UAPI is being worked out) I've been thinking the same general read-and-clear of a dirty bitmap. It matches nicely the the KVM interface. > I was wondering if container has a dirty scan/sync callback funnelled > by a vendor IOMMU ops implemented (as Shameerali patches proposed), Yes, this is almost certainly how the in-kernel parts will look > and vfio vendor driver provides one per device. But this is less clear.. > Or propagate the dirty tracking API to vendor vfio driver[*]. > [*] considering the device may choose where to place its tracking storage, and > which scheme (bitmap, ring, etc) it might be. This has been my thinking, yes > The reporting of the dirtying, though, looks hazzy to achieve if you > try to make it uniform even to userspace. Perhaps with iommu-fd > you're thinking to mmap() the dirty region back to userspace, or an > iommu-fd ioctl() updates the PTEs, while letting the kernel clear > the dirty status via the mmap() object. And that would be the common > API regardless of dirty-hw scheme. Anyway, just thinking out loud. My general thinking has be that iommufd would control only the system IOMMU hardware. The FD interface directly exposes the iommu_domain as a manipulable object, so I'd imagine making userspace have a simple 1:1 connection to the iommu_ops of a single iommu_domain. Doing this avoids all the weirdo questions about what do you do if there is non-uniformity in the iommu_domain's. Keeping with that theme the vfio_device would provide a similar interface, on its own device FD. I don't know if mmap should be involed here, the dirty bitmaps are not so big, I suspect a simple get_user_pages_fast() would be entirely OK. > > VFIO proposed to squash everything > > into the container code, but I've been mulling about having iommufd > > only do system iommu and push the PCI device internal tracking over to > > VFIO. > > > > Seems to me that the juicy part falls mostly in IOMMU vendor code, I am > not sure yet how much one can we 'offload' to a generic layer, at least > compared with this other proposal. Yes, I expect there is very little generic code here if we go this way. The generic layer is just marshalling the ioctl(s) to the iommu drivers. Certainly not providing storage or anything/ > Give me some time (few days only, as I gotta sort some things) and I'll > respond here as follow up with link to a branch with the WIP/PoC patches. Great! > 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty > bit is cheap, clearing them is 'expensive' because one needs to flush > IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result > of an address-translation/io-page-walk. Even though the IOMMU uses interlocked > operations to actually update the Access/Dirty bit in concurrency with > the CPU. The AMD manuals are a tad misleading as they talk about marking > non-present, but that would be catastrophic for migration as it would > mean a DMA target abort for the PCI device, unless I missed something obvious. > In any case, this means that the dirty bit *clearing* needs to be > batched as much as possible, to amortize the cost of flushing the IOTLB. > This is the same for Intel *IIUC*. You have to mark it as non-present to do the final read out if something unmaps while the tracker is on - eg emulating a viommu or something. Then you mark non-present, flush the iotlb and read back the dirty bit. Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and then read and clear them. > 4) Adjust the granularity of pagetables in place: > [This item wasn't done, but it is generic to any IOMMU because it > is mostly the ability to split existing IO pages in place.] This seems like it would be some interesting amount of driver work, but yes it could be a generic new iommu_domina op. > 4.b) Optionally starting dirtying earlier (at provisioning) and let > userspace dynamically split pages. This is to hopefully minimize the > IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly. > So dirty tracking would be enabled at creation of the protection domain > after the vfio container is set up, and we would use pages dirtied > as a indication of what needs to be splited. Problem is for IO page > sizes bigger than 1G, which might unnecessarily lead to marking too > much as dirty early on; but at least it's better than transferring the > whole set. I'm not sure running with dirty tracking permanently on would be good for guest performance either. I'd suspect you'd be better to have a warm up period where you track dirtys and split down pages. It is interesting, this is a possible reason why device dirty tracking might actually perfom better because it can operate at a different granularity from the system iommu without disrupting the guest DMA performance. Jason
On 2/4/22 23:07, Jason Gunthorpe wrote: > On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote: >> On 2/3/22 15:18, Jason Gunthorpe wrote: >>> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote: >>>> On 2/2/22 17:03, Jason Gunthorpe wrote: >>>>> how to integrate that with the iommufd work, which I hope will allow >>>>> that series, and the other IOMMU drivers that can support this to be >>>>> merged.. >>>> >>>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done >>>> there, but TBH I am not up to speed on iommu-fd yet so I missed something >>>> obvious for sure. When you say 'integrate that with the iommufd' can you >>>> expand on that? >>> >>> The general idea is that iommufd is the place to put all the iommu >>> driver uAPI for consumption by userspace. The IOMMU feature of dirty >>> tracking would belong there. >>> >>> So, some kind of API needs to be designed to meet the needs of the >>> IOMMU drivers. >>> >> /me nods >> >> I am gonna assume below is the most up-to-date to iommufd (as you pointed >> out in another thread IIRC): >> >> https://github.com/jgunthorpe/linux iommufd >> >> Let me know if it's not :) > > The iommufd part is pretty good, but there is hacky patch to hook it > into vfio that isn't there, if you want to actually try it. > OK -- Thanks for the heads up. >>> But, as you say, it looks unnatural and inefficient when the domain >>> itself is storing the dirty bits inside the IOPTE. >> >> How much is this already represented as the io-pgtable in IOMMU internal kAPI >> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today >> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :( > > Which are you looking at? AFACIT there is no diry page support in > iommu_ops ? > There is not, indeed. But how you manage IO page tables felt *somewhat hinting* to io-pgtable -- but maybe it's a stretch. The dirty page support builds on top of that, as ARM IOMMUs use iommu io-pgtable. In the series in the link, it pretty much boils down to essentially 3 iommu_ops introduced: * support_dirty_log(domain) - Checks whether a given IOMMU domain supports dirty tracking * switch_dirty_log(domain, enable, iova, size, prot) - Enables/Disables IOMMU dirty tracking on an IOVA range (The range I suppose it's because ARM supports range tracking, contrary to Intel and AMD which affect all the protection domain pg tables) * sync_dirty_log(domain, iova, size, &bitmap, ...) The IO pgtables, have an interesting set of APIs strictly for page table manipulation, and they are sort of optional (so far ARM and AMD). Those include, map, unmap, iova_to_pfn translation. So, the dirty log supports for ARM builds on that and adds split, merge (for splitting and collapsing IO pagtables), and finally sync_dirty_log() which is where we get the bitmap of dirtied pages *updated*. But well, at the end of the day for an IOMMU driver the domain ops are the important stuff, maybe IO pgtable framework isn't as critical (Intel for example, doesn't use that at all). >> then potentially VMM/process can more efficiently scan the dirtied >> set? But if some layer needs to somehow mediate between the vendor >> IOPTE representation and an UAPI IOPTE representation, to be able to >> make that delegation to userspace ... then maybe both might be >> inefficient? I didn't see how iommu-fd would abstract the IOPTEs >> lookup as far as I glanced through the code, perhaps that's another >> ioctl(). > > It is based around the same model as VFIO container - map/unmap of > user address space into the IOPTEs and the user space doesn't see > anything resembling a 'pte' - at least for kernel owned IO page > tables. > Ah! I misinterpreted what you were saying. > User space page tables will not be abstracted and the userspace must > know the direct HW format of the IOMMU they is being used. > That's countering earlier sentence? Because hw format (for me at least) means PTE and protection domain config format too. And if iommufd abstracts the HW format, modelling after the IOMMU domain and its ops, then it's abstracting userspace from those details e.g. it works over IOVAs, but not over its vendor representation of how that IOVA is set up. I am probably being dense. >> But what strikes /specifically/ on the dirty bit feature is that it looks >> simpler with the current VFIO, the heavy lifting seems to be >> mostly on the IOMMU vendor. The proposed API above for VFIO looking at >> the container (small changes), and IOMMU vendor would do most of it: > > It is basically the same, almost certainly the user API in iommufd > will be some 'get dirty bits' and 'unmap and give me the dirty bits' > just like vfio has. > The 'unmap and give dirty bits' looks to be something TBD even in a VFIO migration flow. Considering that we pause vCPUs, and we quiesce the VFs, I guess we are mostly supposed to have a guarantee that no DMA is supposed to be happening (excluding P2P)? So perhaps the unmap is unneeded after quiescing the VF. Which probably is important migration-wise considering the unmap is really what's expensive (e.g. needs IOTLB flush) and that would add up between stop copy and the time it would resume on the destination, if we need to wait to unmap the whole IOVA from source. > The tricky details are around how do you manage this when the system > may have multiple things invovled capable, or not, of actualy doing > dirty tracking. > Yeap. >> At the same time, what particularly scares me perf-wise (for the >> device being migrated) ... is the fact that we need to dynamically >> split and collapse page tables to increase the granularity of which >> we track. In the above interface it splits/collapses when you turn >> on/off the dirty tracking (respectively). That's *probably* where we >> need more flexibility, not sure. > > For sure that is a particularly big adventure in the iommu driver.. > Yeah. Me playing around was with VFIO IOMMU hugepages disabled, and this was something I am planning on working when I get back to this. >> Do you have thoughts on what such device-dirty interface could look like? >> (Perhaps too early to poke while the FSM/UAPI is being worked out) > > I've been thinking the same general read-and-clear of a dirty > bitmap. It matches nicely the the KVM interface. > A bitmap I think it is a good start. We have a bitmap based interface in KVM, but there's also a recent ring interface for dirty tracking, which has some probably more determinism than a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables and breaking its entries on-demand IIRC, whereas Intel resembles something closer to a 512 entries 'ring' with VMX PML, which tells what has been dirtied. >> I was wondering if container has a dirty scan/sync callback funnelled >> by a vendor IOMMU ops implemented (as Shameerali patches proposed), > > Yes, this is almost certainly how the in-kernel parts will look > /me nods >> and vfio vendor driver provides one per device. > > But this is less clear.. > >> Or propagate the dirty tracking API to vendor vfio driver[*]. >> [*] considering the device may choose where to place its tracking storage, and >> which scheme (bitmap, ring, etc) it might be. > > This has been my thinking, yes > /me nods >> The reporting of the dirtying, though, looks hazzy to achieve if you >> try to make it uniform even to userspace. Perhaps with iommu-fd >> you're thinking to mmap() the dirty region back to userspace, or an >> iommu-fd ioctl() updates the PTEs, while letting the kernel clear >> the dirty status via the mmap() object. And that would be the common >> API regardless of dirty-hw scheme. Anyway, just thinking out loud. > > My general thinking has be that iommufd would control only the system > IOMMU hardware. The FD interface directly exposes the iommu_domain as > a manipulable object, so I'd imagine making userspace have a simple > 1:1 connection to the iommu_ops of a single iommu_domain. > > Doing this avoids all the weirdo questions about what do you do if > there is non-uniformity in the iommu_domain's. > > Keeping with that theme the vfio_device would provide a similar > interface, on its own device FD. > > I don't know if mmap should be involed here, the dirty bitmaps are not > so big, I suspect a simple get_user_pages_fast() would be entirely OK. > Considering that is 32MB of a bitmap per TB maybe it is cheap. >>> VFIO proposed to squash everything >>> into the container code, but I've been mulling about having iommufd >>> only do system iommu and push the PCI device internal tracking over to >>> VFIO. >>> >> >> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am >> not sure yet how much one can we 'offload' to a generic layer, at least >> compared with this other proposal. > > Yes, I expect there is very little generic code here if we go this > way. The generic layer is just marshalling the ioctl(s) to the iommu > drivers. Certainly not providing storage or anything/ > Gotcha. Perhaps I need to sort out how this would work with iommufd. But I guess we can hash out how the iommu ops would look like? We seem to be on the same page, at least that's the feeling I get. >> Give me some time (few days only, as I gotta sort some things) and I'll >> respond here as follow up with link to a branch with the WIP/PoC patches. > > Great! > Here it is. "A few days" turn into a week sorry :/ https://github.com/jpemartins/qemu amd-iommu-hdsup-wip https://github.com/jpemartins/linux amd-vfio-iommu-hdsup-wip Note, it is an early PoC. I still need to get the split/collapse thing going, and fix the FIXMEs there, and have a second good look at the iommu page tables. The Qemu link above has one patch to get device-dirty bitmaps via HMP to be able to measure the rate by hw dirties and another patch for emulated amd iommu support too, should you lack the hardware to play (I usually do on both, more for people to be able to coverage test it). I sadly cannot do the end-to-end migration test. >> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty >> bit is cheap, clearing them is 'expensive' because one needs to flush >> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result >> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked >> operations to actually update the Access/Dirty bit in concurrency with >> the CPU. The AMD manuals are a tad misleading as they talk about marking >> non-present, but that would be catastrophic for migration as it would >> mean a DMA target abort for the PCI device, unless I missed something obvious. >> In any case, this means that the dirty bit *clearing* needs to be >> batched as much as possible, to amortize the cost of flushing the IOTLB. >> This is the same for Intel *IIUC*. > > You have to mark it as non-present to do the final read out if > something unmaps while the tracker is on - eg emulating a viommu or > something. Then you mark non-present, flush the iotlb and read back > the dirty bit. > You would be surprised that AMD IOMMUs have also an accelerated vIOMMU too :) without needing VMM intervention (that's also not supported in Linux). But the mark as NP for viommu is something I haven't investigated. > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and > then read and clear them. > It's the other way around AIUI. The dirty bits are sticky, so you flush the IOTLB after clearing as means to notify the IOMMU to set the dirty bits again on the next memory transaction (or ATS translation). I am not entirely sure we need to unmap + mark non-present for non-viommu That would actually mean something is not properly quiscieing the VF DMA. Maybe we should .. to gate whether if we should actually continue with LM if something kept doing DMA when it shouldn't have. Also, considering the dirty-bitmap check happens on very-very long IOVA extents -- some measurements over how long the scans need to be done. Mostly validated how much a much NIC was dirtying with traditional networking tools. >> 4) Adjust the granularity of pagetables in place: >> [This item wasn't done, but it is generic to any IOMMU because it >> is mostly the ability to split existing IO pages in place.] > > This seems like it would be some interesting amount of driver work, > but yes it could be a generic new iommu_domina op. > I am slightly at odds that .split and .collapse at .switch() are enough. But, with iommu if we are working on top of an IOMMU domain object and .split and .collapse are iommu_ops perhaps that looks to be enough flexibility to give userspace the ability to decide what it wants to split, if it starts eargerly/warming-up tracking dirty pages. The split and collapsing is something I wanted to work on next, to get to a stage closer to that of an RFC on the AMD side. >> 4.b) Optionally starting dirtying earlier (at provisioning) and let >> userspace dynamically split pages. This is to hopefully minimize the >> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly. >> So dirty tracking would be enabled at creation of the protection domain >> after the vfio container is set up, and we would use pages dirtied >> as a indication of what needs to be splited. Problem is for IO page >> sizes bigger than 1G, which might unnecessarily lead to marking too >> much as dirty early on; but at least it's better than transferring the >> whole set. > > I'm not sure running with dirty tracking permanently on would be good > for guest performance either. > Hmmm, judging how the IOMMU works I am not sure this is particularly affecting DMA performance (not sure yet about RDMA, it's something I curious to see how it gets to perform with 4K IOPTEs, and with dirty tracking always enabled). Considering how the bits are sticky, and unless CPU clears it, it's short of a nop? Unless of course the checking for A^D during an atomic memory transaction is expensive. Needs some performance testing nonetheless. The IOTLB flushes pre/after clearing, tough, would visibly hurt DMA performance :( I forgot to mention, but the early enablement of IOMMU dirty tracking was also meant to fully know since guest creation what needs to be sent to the destination. Otherwise, wouldn't we need to send the whole pinned set to destination, if we only start tracking dirty pages during migration? > I'd suspect you'd be better to have a warm up period where you track > dirtys and split down pages. > Yeah, but right now the splitting is done when switching on and off dirty tracking, for the entirety of page tables. And well at least on AMD, the flag is attached to a protection domain. For SMMUv3. it appears that it's per IOVA range if I read the original code correctly (which is interesting contrast to AMD/Intel) Also, this is probably a differentiator for iommufd, if we were to provide split and collapse semantics to IOMMU domain objects that userspace can use. That would get more freedom, to switch dirty-tracking, and then do the warm up thingie and piggy back on what it wants to split before migration. perhaps the switch() should get some flag to pick where to split, I guess. > It is interesting, this is a possible reason why device dirty tracking > might actually perfom better because it can operate at a different > granularity from the system iommu without disrupting the guest DMA > performance. I agree -- and the other advantage is that you don't depend on platform support for dirty tracking, which is largely nonexistent. Albeit I personally like to have the IOMMU support because it also frees endpoints to do rather expensive dirty tracking, which is a rather even more exclusive feature to have. It's trading off max performance for LM commodity over, where DMA performance can be less (momentarily). Joao
On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote: > But well, at the end of the day for an IOMMU driver the domain ops are > the important stuff, maybe IO pgtable framework isn't as critical > (Intel for example, doesn't use that at all). Right, it doesn't matter what library was used to implement the domain.. > > User space page tables will not be abstracted and the userspace must > > know the direct HW format of the IOMMU they is being used. > > > That's countering earlier sentence? Because hw format (for me at least) > means PTE and protection domain config format too. And if iommufd > abstracts the HW format, modelling after the IOMMU domain and its ops, > then it's abstracting userspace from those details e.g. it works over > IOVAs, but not over its vendor representation of how that IOVA is set up. > > I am probably being dense. It is both ways, one kind of domain provides a kernel supplied map/unmap that implements the IO PTE manipulation in kernel memory But if you want the IOPTE to be in user memory then user must read/write it and it cannot use that - so a user domain will not have map/unmap. > > It is basically the same, almost certainly the user API in iommufd > > will be some 'get dirty bits' and 'unmap and give me the dirty bits' > > just like vfio has. > > > > The 'unmap and give dirty bits' looks to be something TBD even in a VFIO > migration flow. It is essential to implement any kind of viommu behavior where map/unmap is occuring while the dirty tracking is running. It should never make a difference except in some ugly edge cases where the dma and unmap are racing. > supposed to be happening (excluding P2P)? So perhaps the unmap is > unneeded after quiescing the VF. Yes, you don't need to unmap for migration, a simple fully synchronous read and clear is sufficient. But that final read, while DMA is quite, must obtain the latest dirty bit data. > We have a bitmap based interface in KVM, but there's also a recent ring > interface for dirty tracking, which has some probably more determinism than > a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables > and breaking its entries on-demand IIRC, whereas Intel resembles something > closer to a 512 entries 'ring' with VMX PML, which tells what has been > dirtied. KVM has an advantage that it can throttle the rate of dirty generation so it can rely on logging. The IOMMU can't throttle DMA, so we are stuck with a bitmap > > I don't know if mmap should be involed here, the dirty bitmaps are not > > so big, I suspect a simple get_user_pages_fast() would be entirely OK. > > > Considering that is 32MB of a bitmap per TB maybe it is cheap. Rigt. GUP fasting a couple huge pages is nothing compared to scanning 1TB of IO page table. > >> Give me some time (few days only, as I gotta sort some things) and I'll > >> respond here as follow up with link to a branch with the WIP/PoC patches. > > > > Great! > > > Here it is. "A few days" turn into a week sorry :/ > > https://github.com/jpemartins/qemu amd-iommu-hdsup-wip > https://github.com/jpemartins/linux amd-vfio-iommu-hdsup-wip > > Note, it is an early PoC. I still need to get the split/collapse thing going, > and fix the FIXMEs there, and have a second good look at the iommu page tables. Oh I'll try to look a this thanks > > You have to mark it as non-present to do the final read out if > > something unmaps while the tracker is on - eg emulating a viommu or > > something. Then you mark non-present, flush the iotlb and read back > > the dirty bit. > > > You would be surprised that AMD IOMMUs have also an accelerated vIOMMU > too :) without needing VMM intervention (that's also not supported > in Linux). I'm sure, but dirty tracking has to happen on the kernel owned page table, not the user owned one I think.. > > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and > > then read and clear them. > > > It's the other way around AIUI. The dirty bits are sticky, so you flush > the IOTLB after clearing as means to notify the IOMMU to set the dirty bits > again on the next memory transaction (or ATS translation). I guess it depends on how the HW works, if it writes the dirty bit back to ram instantly on the first dirty, or if it only writes the dirty bit when flushing the iotlb. In any case you have to synchronize with the HW in some way to ensure that all dirty bits are 'pulled back' into system memory to resolve races (ie you need the iommu HW to release and the CPU to acquire on the dirty data) before trying to read them, at least for the final quieced system read. > I am not entirely sure we need to unmap + mark non-present for non-viommu > That would actually mean something is not properly quiscieing the VF DMA. > Maybe we should .. to gate whether if we should actually continue with LM > if something kept doing DMA when it shouldn't have. It is certainly an edge case. A device would be misbehaving to continue DMA. > > This seems like it would be some interesting amount of driver work, > > but yes it could be a generic new iommu_domina op. > > I am slightly at odds that .split and .collapse at .switch() are enough. > But, with iommu if we are working on top of an IOMMU domain object and > .split and .collapse are iommu_ops perhaps that looks to be enough > flexibility to give userspace the ability to decide what it wants to > split, if it starts eargerly/warming-up tracking dirty pages. > > The split and collapsing is something I wanted to work on next, to get > to a stage closer to that of an RFC on the AMD side. split/collapse seems kind of orthogonal to me it doesn't really connect to dirty tracking other than being mostly useful during dirty tracking. And I wonder how hard split is when trying to atomically preserve any dirty bit.. > Hmmm, judging how the IOMMU works I am not sure this is particularly > affecting DMA performance (not sure yet about RDMA, it's something I > curious to see how it gets to perform with 4K IOPTEs, and with dirty > tracking always enabled). Considering how the bits are sticky, and > unless CPU clears it, it's short of a nop? Unless of course the checking > for A^D during an atomic memory transaction is expensive. Needs some > performance testing nonetheless. If you leave the bits all dirty then why do it? The point is to see the dirties, which means the iommu is generating a workload of dirty cachelines while operating it didn't do before. > I forgot to mention, but the early enablement of IOMMU dirty tracking > was also meant to fully know since guest creation what needs to be > sent to the destination. Otherwise, wouldn't we need to send the whole > pinned set to destination, if we only start tracking dirty pages during > migration? ? At the start of migration you have to send everything. Dirty tracking is intended to allow the VM to be stopped and then have only a small set of data to xfer. > Also, this is probably a differentiator for iommufd, if we were to provide > split and collapse semantics to IOMMU domain objects that userspace can use. > That would get more freedom, to switch dirty-tracking, and then do the warm > up thingie and piggy back on what it wants to split before migration. > perhaps the switch() should get some flag to pick where to split, I guess. Yes, right. Split/collapse should be completely seperate from dirty tracking. Jason
On 2/11/22 17:49, Jason Gunthorpe wrote: > On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote: >>> It is basically the same, almost certainly the user API in iommufd >>> will be some 'get dirty bits' and 'unmap and give me the dirty bits' >>> just like vfio has. >>> >> >> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO >> migration flow. > > It is essential to implement any kind of viommu behavior where > map/unmap is occuring while the dirty tracking is running. It should > never make a difference except in some ugly edge cases where the dma > and unmap are racing. > /me nods >> supposed to be happening (excluding P2P)? So perhaps the unmap is >> unneeded after quiescing the VF. > > Yes, you don't need to unmap for migration, a simple fully synchronous > read and clear is sufficient. But that final read, while DMA is quite, > must obtain the latest dirty bit data. > The final read doesn't need special logic in VFIO/IOMMU IUC (maybe only in the VMM) Anyways, the above paragraph matches my understanding. >> We have a bitmap based interface in KVM, but there's also a recent ring >> interface for dirty tracking, which has some probably more determinism than >> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables >> and breaking its entries on-demand IIRC, whereas Intel resembles something >> closer to a 512 entries 'ring' with VMX PML, which tells what has been >> dirtied. > > KVM has an advantage that it can throttle the rate of dirty generation > so it can rely on logging. The IOMMU can't throttle DMA, so we are > stuck with a bitmap > Yeap, sadly :( >>> I don't know if mmap should be involed here, the dirty bitmaps are not >>> so big, I suspect a simple get_user_pages_fast() would be entirely OK. >>> >> Considering that is 32MB of a bitmap per TB maybe it is cheap. > > Rigt. GUP fasting a couple huge pages is nothing compared to scanning > 1TB of IO page table. > ... With 4K PTEs which is even more ludricous expensive. Well yeah, a lesser concern the mangling of the bitmap, when you put that way heh >>> You have to mark it as non-present to do the final read out if >>> something unmaps while the tracker is on - eg emulating a viommu or >>> something. Then you mark non-present, flush the iotlb and read back >>> the dirty bit. >>> >> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU >> too :) without needing VMM intervention (that's also not supported >> in Linux). > > I'm sure, but dirty tracking has to happen on the kernel owned page > table, not the user owned one I think.. > The plumbing for the hw-accelerated vIOMMU is a little different that a regular vIOMMU, at least IIUC host does not take an active part in the GVA -> GPA translation. Suravee's preso explains it nicely, if you don't have time to fiddle with the SDM: https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf >>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and >>> then read and clear them. >>> >> It's the other way around AIUI. The dirty bits are sticky, so you flush >> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits >> again on the next memory transaction (or ATS translation). > > I guess it depends on how the HW works, if it writes the dirty bit > back to ram instantly on the first dirty, or if it only writes the > dirty bit when flushing the iotlb. > The manual says roughly that the update is visible to CPU as soon as the it updates. Particularly from the IOMMU SDM: "Note that the setting of accessed and dirty status bits in the page tables is visible to both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked operations to update A and D bits must be 64-bit operations and naturally aligned on a 64-bit boundary" And ... 1. Decodes the read and write intent from the memory access. 2. If P=0 in the page descriptor, fail the access. 3. Compare the A & D bits in the descriptor with the read and write intent in the request. 4. If the A or D bits need to be updated in the descriptor: * Start atomic operation. * Read the descriptor as a 64-bit access. * If the descriptor no longer appears to require an update, release the atomic lock with no further action and continue to step 5. * Calculate the new A & D bits. * Write the descriptor as a 64-bit access. * End atomic operation. 5. Continue to the next stage of translation or to the memory access. > In any case you have to synchronize with the HW in some way to ensure > that all dirty bits are 'pulled back' into system memory to resolve > races (ie you need the iommu HW to release and the CPU to acquire on > the dirty data) before trying to read them, at least for the final > quieced system read. > /me nods >>> This seems like it would be some interesting amount of driver work, >>> but yes it could be a generic new iommu_domina op. >> >> I am slightly at odds that .split and .collapse at .switch() are enough. >> But, with iommu if we are working on top of an IOMMU domain object and >> .split and .collapse are iommu_ops perhaps that looks to be enough >> flexibility to give userspace the ability to decide what it wants to >> split, if it starts eargerly/warming-up tracking dirty pages. >> >> The split and collapsing is something I wanted to work on next, to get >> to a stage closer to that of an RFC on the AMD side. > > split/collapse seems kind of orthogonal to me it doesn't really > connect to dirty tracking other than being mostly useful during dirty > tracking. > > And I wonder how hard split is when trying to atomically preserve any > dirty bit.. > Would would it be hard? The D bit is supposed to be replicated when you split to smaller page size. >> Hmmm, judging how the IOMMU works I am not sure this is particularly >> affecting DMA performance (not sure yet about RDMA, it's something I >> curious to see how it gets to perform with 4K IOPTEs, and with dirty >> tracking always enabled). Considering how the bits are sticky, and >> unless CPU clears it, it's short of a nop? Unless of course the checking >> for A^D during an atomic memory transaction is expensive. Needs some >> performance testing nonetheless. > > If you leave the bits all dirty then why do it? The point is to see > the dirties, which means the iommu is generating a workload of dirty > cachelines while operating it didn't do before. > My thinking was that if it's dirtied and in the IOTLB most likely the descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk to resolve an IOVA, perhaps the check for the A & D bits needing to be updated is probably the least problem in this path. Naturally, if it's not split, you have a much higher chance (e.g. with 1GB entries) to stay in the IOTLB and just compare two bits *prior* to consider starting the atomic operation to update the descriptor. >> I forgot to mention, but the early enablement of IOMMU dirty tracking >> was also meant to fully know since guest creation what needs to be >> sent to the destination. Otherwise, wouldn't we need to send the whole >> pinned set to destination, if we only start tracking dirty pages during >> migration? > > ? At the start of migration you have to send everything. Dirty > tracking is intended to allow the VM to be stopped and then have only > a small set of data to xfer. > Right, that's how it works today. This is just preemptive longterm thinking about the overal problem space (probably unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs. Particular that the stage *prior* to precopy takes way too long to transfer the whole memory. So I was thinking say only transfer the pages that are populated[*] in the second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning (prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA entries, to maybe only dirty a smaller subset, saving a whole bootload of time. [*] VMs without VFIO would be even easier as the first stage page tables are non-present. >> Also, this is probably a differentiator for iommufd, if we were to provide >> split and collapse semantics to IOMMU domain objects that userspace can use. >> That would get more freedom, to switch dirty-tracking, and then do the warm >> up thingie and piggy back on what it wants to split before migration. >> perhaps the switch() should get some flag to pick where to split, I guess. > > Yes, right. Split/collapse should be completely seperate from dirty > tracking. Yeap. I wonder if we could start progressing the dirty tracking as a first initial series and then have the split + collapse handling as a second part? That would be quite nice to get me going! :D Joao
On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote: > The plumbing for the hw-accelerated vIOMMU is a little different that > a regular vIOMMU, at least IIUC host does not take an active part in the > GVA -> GPA translation. Suravee's preso explains it nicely, if you don't > have time to fiddle with the SDM: > > https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf That looks like the same nesting everyone else is talking about. I've been refering to this as 'user space page tables' to avoid a lot of ambiguity becuase what it fundamentally does is allow userspace to directly manage an IO page table - building the IO PTEs, issue invalidations, etc. To be secure the userspace page table is nested under a kernel owned page table and all the IO PTE offsets/etc are understood to be within the address space of the kernel owned page table. When combined with KVM the userspace is just inside a VM. > 1. Decodes the read and write intent from the memory access. > 2. If P=0 in the page descriptor, fail the access. > 3. Compare the A & D bits in the descriptor with the read and write intent in the request. > 4. If the A or D bits need to be updated in the descriptor: Ah, so the dirty update is actually atomic on the first write before any DMA happens - and I suppose all of this happens when the entry is first loaded into the IOTLB. So the flush is to allow the IOTLB to see the cleared D bit.. > > split/collapse seems kind of orthogonal to me it doesn't really > > connect to dirty tracking other than being mostly useful during dirty > > tracking. > > > > And I wonder how hard split is when trying to atomically preserve any > > dirty bit.. > > > Would would it be hard? The D bit is supposed to be replicated when you > split to smaller page size. I guess it depends on how the 'acquire' is done, as the CPU has to atomically replace a large entry with a pointer to a small entry, flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set in the old entry then it has to sprinkle it into the new entries with atomics. > This is just preemptive longterm thinking about the overal problem > space (probably unnecessary noise at this stage). Particularly > whenever I need to migrate 1 to 2TB VMs. Particular that the stage > *prior* to precopy takes way too long to transfer the whole > memory. So I was thinking say only transfer the pages that are > populated[*] in the second-stage page tables (for the CPU) coupled > with IOMMU tracking from the beginning (prior to vcpus even > entering). That could probably decrease 1024 1GB Dirtied IOVA > entries, to maybe only dirty a smaller subset, saving a whole > bootload of time. Oh, you want to effectively optimize zero page detection.. > I wonder if we could start progressing the dirty tracking as a first initial series and > then have the split + collapse handling as a second part? That would be quite > nice to get me going! :D I think so, and I think we should. It is such a big problem space, it needs to get broken up. Jason
On 2/12/22 00:01, Jason Gunthorpe wrote: > On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote: >> 1. Decodes the read and write intent from the memory access. >> 2. If P=0 in the page descriptor, fail the access. >> 3. Compare the A & D bits in the descriptor with the read and write intent in the request. >> 4. If the A or D bits need to be updated in the descriptor: > > Ah, so the dirty update is actually atomic on the first write before > any DMA happens - and I suppose all of this happens when the entry is > first loaded into the IOTLB. > Intel's equivalent feature suggests me that this works the same way there. ARM update of ioptes looks to be slightly different[*] but this FEAT_BBM in SMMUv3.2 makes it work in similar way to Intel/AMD. But I could be misunderstanding something there. [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB and then write the new valid entry. Not sure I understood correctly that this is the 'break-before-make' thingie. > So the flush is to allow the IOTLB to see the cleared D bit.. > Right. >>> split/collapse seems kind of orthogonal to me it doesn't really >>> connect to dirty tracking other than being mostly useful during dirty >>> tracking. >>> >>> And I wonder how hard split is when trying to atomically preserve any >>> dirty bit.. >>> >> Would would it be hard? The D bit is supposed to be replicated when you >> split to smaller page size. > > I guess it depends on how the 'acquire' is done, as the CPU has to > atomically replace a large entry with a pointer to a small entry, > flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set > in the old entry then it has to sprinkle it into the new entries with > atomics. > ISTR some mention to what we are chatting here in the IOMMU SDM: When a non-default page size is used , software must OR the Dirty bits in all of the replicated host PTEs used to map the page. The IOMMU does not guarantee the Dirty bits are set in all of the replicated PTEs. Any portion of the page may have been written even if the Dirty bit is set in only one of the replicated PTEs. >> I wonder if we could start progressing the dirty tracking as a first initial series and >> then have the split + collapse handling as a second part? That would be quite >> nice to get me going! :D > > I think so, and I think we should. It is such a big problem space, it > needs to get broken up. OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the past except with the x86 support only[*]. And we poke holes there I guess. [*] I might include Intel too, albeit emulated only.
On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote: > [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB > and then write the new valid entry. Not sure I understood correctly that this > is the 'break-before-make' thingie. Doesn't that explode if the invalid entry is DMA'd to? > When a non-default page size is used , software must OR the Dirty bits in > all of the replicated host PTEs used to map the page. The IOMMU does not > guarantee the Dirty bits are set in all of the replicated PTEs. Any portion > of the page may have been written even if the Dirty bit is set in only one > of the replicated PTEs. I suspect that is talking about something else > >> I wonder if we could start progressing the dirty tracking as a first initial series and > >> then have the split + collapse handling as a second part? That would be quite > >> nice to get me going! :D > > > > I think so, and I think we should. It is such a big problem space, it > > needs to get broken up. > > OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the > past except with the x86 support only[*]. And we poke holes there I guess. > > [*] I might include Intel too, albeit emulated only. Like I said, I'd prefer we not build more on the VFIO type 1 code until we have a conclusion for iommufd.. While returning the dirty data looks straight forward, it is hard to see an obvious path to enabling and controlling the system iommu the way vfio is now. Jason
On 2/14/22 14:06, Jason Gunthorpe wrote: > On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote: > >> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB >> and then write the new valid entry. Not sure I understood correctly that this >> is the 'break-before-make' thingie. > > Doesn't that explode if the invalid entry is DMA'd to? > Yes, IIUC. Also, the manual has this note: "Note: For example, to split a block into constituent granules (or to merge a span of granules into an equivalent block), VMSA requires the region to be made invalid, a TLB invalidate performed, then to make the region take the new configuration. Note: The requirement for a break-before-make sequence can cause problems for unrelated I/O streams that might use addresses overlapping a region of interest, because the I/O streams cannot always be conveniently stopped and might not tolerate translation faults. It is advantageous to perform live update of a block into smaller translations, or a set of translations into a larger block size." Probably why the original SMMUv3.2 dirty track series requires FEAT_BBM as it had to do in-place atomic updates to split/collapse IO pgtables. Not enterily clear if HTTU Dirty access requires the same. >>>> I wonder if we could start progressing the dirty tracking as a first initial series and >>>> then have the split + collapse handling as a second part? That would be quite >>>> nice to get me going! :D >>> >>> I think so, and I think we should. It is such a big problem space, it >>> needs to get broken up. >> >> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the >> past except with the x86 support only[*]. And we poke holes there I guess. >> >> [*] I might include Intel too, albeit emulated only. > > Like I said, I'd prefer we not build more on the VFIO type 1 code > until we have a conclusion for iommufd.. > I didn't quite understand what you mean by conclusion. If by conclusion you mean the whole thing to be merged, how can the work be broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant in terms of direction... I can build on top of iommufd -- Just trying to understand how this is going to work out. > While returning the dirty data looks straight forward, it is hard to > see an obvious path to enabling and controlling the system iommu the > way vfio is now. It seems strange to have a whole UAPI for userspace [*] meant to return dirty data to userspace, when dirty right now means the whole pinned page set and so copying the whole guest ... and the guest is running so we might be racing with the device changing guest pages with the VMM/CPU unaware of it. Even with no dwelling of IOMMU pagetables (i.e. split/collapse IO base pages) it would still help greatly the current status quo of copying the entire thing :( Hence my thinking was that the patches /if small/ would let us see how dirty tracking might work for iommu kAPI (and iommufd) too. Would it be better to do more iterative steps (when possible) as opposed to scratch and rebuild VFIO type1 IOMMU handling? Joao [*] VFIO_IOMMU_DIRTY_PAGES{_FLAG_START,_FLAG_STOP,_FLAG_GET_BITMAP}
On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote: > On 2/14/22 14:06, Jason Gunthorpe wrote: > > On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote: > > > >> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB > >> and then write the new valid entry. Not sure I understood correctly that this > >> is the 'break-before-make' thingie. > > > > Doesn't that explode if the invalid entry is DMA'd to? > > > Yes, IIUC. Also, the manual has this note: Heh, sounds like "this doesn't work" to me :) > > Like I said, I'd prefer we not build more on the VFIO type 1 code > > until we have a conclusion for iommufd.. > > > > I didn't quite understand what you mean by conclusion. If people are dead-set against doing iommufd, then lets abandon the idea and go back to hacking up vfio. > If by conclusion you mean the whole thing to be merged, how can the work be > broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant > in terms of direction... I think go ahead and build it on top of iommufd, start working out the API details, etc. I think once the direction is concluded the new APIs will go forward. > > While returning the dirty data looks straight forward, it is hard to > > see an obvious path to enabling and controlling the system iommu the > > way vfio is now. > > It seems strange to have a whole UAPI for userspace [*] meant to > return dirty data to userspace, when dirty right now means the whole > pinned page set and so copying the whole guest ... Yes, the whole thing is only partially implemented, and doesn't have any in-kernel user. It is another place holder for an implementation to come someday. > Hence my thinking was that the patches /if small/ would let us see how dirty > tracking might work for iommu kAPI (and iommufd) too. It could be tried, but I think if you go into there you will find it quickly turns quite complicated to address all the edge cases. Eg what do you do if you have a mdev present after you turn on system tracking? What if the mdev is using a PASID? What about hotplug of new VFIO devices? Remember, dirty tracking for vfio is totally useless without also having vfio device migration. Do you already have a migration capable device to use with this? > Would it be better to do more iterative steps (when possible) as opposed to > scratch and rebuild VFIO type1 IOMMU handling? Possibly, but every thing that gets added has to be carried over to the new code too, and energy has to be expended trying to figure out how the half implemented stuff should work while finishing it. At the very least we must decide what to do with device-provided dirty tracking before the VFIO type1 stuff can be altered to use the system IOMMU. This is very much like the migration FSM, the only appeal is the existing qemu implementation of the protocol. Jason
On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote: > Hi, > > This series attempts to add vfio live migration support for > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space > includes both the functional register space and migration > control register space. As discussed in RFCv1[0], this may create > security issues as these regions get shared between the Guest > driver and the migration driver. Based on the feedback, we tried > to address those concerns in this version. > > This is now based on the new vfio-pci-core framework proposal[1]. > Understand that the framework proposal is still under discussion, > but really appreciate any feedback on the approach taken here > to mitigate the security risks. Do you want to try to get this into this kernel along with the mlx5 driver? How RFCy is this? If you can post a v2 with the changes discussed soon I think there is a good chance of this. Thanks, Jason
On 2/15/22 16:21, Jason Gunthorpe wrote: > On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote: >> On 2/14/22 14:06, Jason Gunthorpe wrote: >>> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote: >>> >>>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB >>>> and then write the new valid entry. Not sure I understood correctly that this >>>> is the 'break-before-make' thingie. >>> >>> Doesn't that explode if the invalid entry is DMA'd to? >>> >> Yes, IIUC. Also, the manual has this note: > > Heh, sounds like "this doesn't work" to me :) > Yeah, but I remember reading in manual that HTTUD (what ARM calls it for dirty tracking, albeit DBM is another term for the same thing) requires FEAT_BBM which avoids us to play the above games. So, supposedly, we can "just" use atomics with IOPTE changes and IOTLB flush. Not if we need the latter flush before or after on smmuv3. >>> Like I said, I'd prefer we not build more on the VFIO type 1 code >>> until we have a conclusion for iommufd.. >>> >> >> I didn't quite understand what you mean by conclusion. > > If people are dead-set against doing iommufd, then lets abandon the > idea and go back to hacking up vfio. > Heh, I was under the impression everybody was investing so much *because* that direction was set onto iommufd direction. >> If by conclusion you mean the whole thing to be merged, how can the work be >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant >> in terms of direction... > > I think go ahead and build it on top of iommufd, start working out the > API details, etc. I think once the direction is concluded the new APIs > will go forward. > /me nods, will do. Looking at your repository it is looking good. >>> While returning the dirty data looks straight forward, it is hard to >>> see an obvious path to enabling and controlling the system iommu the >>> way vfio is now. >> >> It seems strange to have a whole UAPI for userspace [*] meant to >> return dirty data to userspace, when dirty right now means the whole >> pinned page set and so copying the whole guest ... > > Yes, the whole thing is only partially implemented, and doesn't have > any in-kernel user. It is another place holder for an implementation > to come someday. > Yeap, seems like. >> Hence my thinking was that the patches /if small/ would let us see how dirty >> tracking might work for iommu kAPI (and iommufd) too. > > It could be tried, but I think if you go into there you will find it > quickly turns quite complicated to address all the edge cases. Eg what > do you do if you have a mdev present after you turn on system > tracking? What if the mdev is using a PASID? > What about hotplug of new > VFIO devices? > > Remember, dirty tracking for vfio is totally useless without also > having vfio device migration. Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless if we can't do the device part first. But if quiescing DMA and saving state are two hard requirements that are mandatory for a live migrateable VF, having dirty tracking in the devices I suspect might be more rare. So perhaps people will look at IOMMUs as a commodity-workaround to avoid a whole bunch of hardware logic for dirty tracking, even bearing what it entails for DMA performance (hisilicon might be an example). > Do you already have a migration capable > device to use with this? > Not yet, but soon I hope. >> Would it be better to do more iterative steps (when possible) as opposed to >> scratch and rebuild VFIO type1 IOMMU handling? > > Possibly, but every thing that gets added has to be carried over to > the new code too, and energy has to be expended trying to figure out > how the half implemented stuff should work while finishing it. > /me nods I understand > At the very least we must decide what to do with device-provided dirty > tracking before the VFIO type1 stuff can be altered to use the system > IOMMU. > I, too, have been wondering what that is going to look like -- and how do we convey the setup of dirty tracking versus the steering of it. > This is very much like the migration FSM, the only appeal is the > existing qemu implementation of the protocol. Yeah.
On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: > > If people are dead-set against doing iommufd, then lets abandon the > > idea and go back to hacking up vfio. > > > Heh, I was under the impression everybody was investing so much *because* > that direction was set onto iommufd direction. Such is the hope :) > >> If by conclusion you mean the whole thing to be merged, how can the work be > >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant > >> in terms of direction... > > > > I think go ahead and build it on top of iommufd, start working out the > > API details, etc. I think once the direction is concluded the new APIs > > will go forward. > > > /me nods, will do. Looking at your repository it is looking good. I would like to come with some plan for dirty tracking on iommufd and combine that with a plan for dirty tracking inside the new migration drivers. > Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless > if we can't do the device part first. But if quiescing DMA and saving > state are two hard requirements that are mandatory for a live migrateable > VF, having dirty tracking in the devices I suspect might be more > rare. So far all but one of the live migration devices I know about can do dirty tracking internally.. > So perhaps people will look at IOMMUs as a commodity-workaround to avoid > a whole bunch of hardware logic for dirty tracking, even bearing what it > entails for DMA performance (hisilicon might be an example). I do expect this will be true. > > At the very least we must decide what to do with device-provided dirty > > tracking before the VFIO type1 stuff can be altered to use the system > > IOMMU. > > I, too, have been wondering what that is going to look like -- and how do we > convey the setup of dirty tracking versus the steering of it. What I suggested was to just split them. Some ioctl toward IOMMUFD will turn on the system iommu tracker - this would be on a per-domain basis, not on the ioas. Some ioctl toward the vfio device will turn on the device's tracker. Userspace has to iterate and query each enabled tracker. This is not so bad because the time to make the system call is going to be tiny compared to the time to marshal XXGB of dirty bits. This makes qemu more complicated because it has to decide what trackers to turn on, but that is also the point because we do want userspace to be able to decide. The other idea that has some possible interest is to allow the trackers to dump their dirty bits into the existing kvm tracker, then userspace just does a single kvm centric dirty pass. Jason
On 2/23/22 01:03, Jason Gunthorpe wrote: > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>> If by conclusion you mean the whole thing to be merged, how can the work be >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant >>>> in terms of direction... >>> >>> I think go ahead and build it on top of iommufd, start working out the >>> API details, etc. I think once the direction is concluded the new APIs >>> will go forward. >>> >> /me nods, will do. Looking at your repository it is looking good. > > I would like to come with some plan for dirty tracking on iommufd and > combine that with a plan for dirty tracking inside the new migration > drivers. > I had a few things going on my end over the past weeks, albeit it is getting a bit better now and I will be coming back to this topic. I hope/want to give you a more concrete update/feedback over the coming week or two wrt to dirty-tracking+iommufd+amd. So far, I am not particularly concerned that this will affect overall iommufd design. The main thing is really lookups to get vendor iopte, upon on what might be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking, I'll be simplifying the interface in the other type1 series I had and making it a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can enable/disable over a domain. Perhaps same trick could be expanded to other features to have a bit more control on what userspace is allowed to use. I think this just needs to set/clear a feature bit in the domain, for VFIO or userspace to have full control during the different stages of migration of dirty tracking. In all of the IOMMU implementations/manuals I read it means setting a protection domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent (albeit past work had also it always-on). Provided the iommufd does /separately/ more finer granularity on what we can do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size at will as separate operations, before and after migration respectivally. That logic would probably be better to be in separate iommufd ioctls(), as that it's going to be expensive. >> I, too, have been wondering what that is going to look like -- and how do we >> convey the setup of dirty tracking versus the steering of it. > > What I suggested was to just split them. > > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this > would be on a per-domain basis, not on the ioas. > > Some ioctl toward the vfio device will turn on the device's tracker. > In the activation/fetching-data of either trackers I see some things in common in terms of UAPI with the difference that whether a device or a list of devices are passed on as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading your suggestion) Albeit perhaps the main difference is going to be that one needs to setup with hardware interface with the device tracker and how we carry the regions of memory that want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped. In IOMMU this is sort of cheap and 'stateless', but on the setup of the device tracker might mean giving all the IOVA ranges to the PF (once?). Perhaps leaving to the vendor driver to pick when to register the IOVA space to be tracked, or perhaps when you turn on the device's tracker. But on all cases, the driver needs some form of awareness of and convey that to the PF for tracking purposes. > Userspace has to iterate and query each enabled tracker. This is not > so bad because the time to make the system call is going to be tiny > compared to the time to marshal XXGB of dirty bits. > Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap, following by a smaller cost copying back to userspace (which KVM does too, when it's using a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy (gup as you mentioned earlier in the thread), or just copying based on the input bitmap (from PF) number of leading zeroes within some threshold. > This makes qemu more complicated because it has to decide what > trackers to turn on, but that is also the point because we do want > userspace to be able to decide. > If the interface wants extending to pass a device or an array of devices (if I understood you correctly), it would free/simplify VFIO from having to concatenate potentially different devices bitmaps into one. Albeit would require optimizing the marshalling a bit more to avoid performing too much copying. > The other idea that has some possible interest is to allow the > trackers to dump their dirty bits into the existing kvm tracker, then > userspace just does a single kvm centric dirty pass. That would probably limit certain more modern options of ring based dirty tracking, as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least, would require KVM to always use a bitmap during migration/dirty-rate-estimation with the presence of vfio/vdpa devices. It's a nice idea, though. It would require making core-iommu aware of bitmap as external storage for tracking (that is not iommufd as it's a module). Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one. I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was made for that purpose).
On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: > On 2/23/22 01:03, Jason Gunthorpe wrote: > > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: > >>>> If by conclusion you mean the whole thing to be merged, how can the work be > >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant > >>>> in terms of direction... > >>> > >>> I think go ahead and build it on top of iommufd, start working out the > >>> API details, etc. I think once the direction is concluded the new APIs > >>> will go forward. > >>> > >> /me nods, will do. Looking at your repository it is looking good. > > > > I would like to come with some plan for dirty tracking on iommufd and > > combine that with a plan for dirty tracking inside the new migration > > drivers. > > > I had a few things going on my end over the past weeks, albeit it is > getting a bit better now and I will be coming back to this topic. I hope/want > to give you a more concrete update/feedback over the coming week or two wrt > to dirty-tracking+iommufd+amd. > > So far, I am not particularly concerned that this will affect overall iommufd > design. The main thing is really lookups to get vendor iopte, upon on what might > be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling > the tracking, I'm not very keen on these multiplexer interfaces. I think you should just add a new ops to the new iommu_domain_ops 'set_dirty_tracking' 'read_dirty_bits' NULL op means not supported. IMHO we don't need a kapi wrapper if only iommufd is going to call the op. > I'll be simplifying the interface in the other type1 series I had and making it > a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can > enable/disable over a domain. Perhaps same trick could be expanded to other > features to have a bit more control on what userspace is allowed to use. I think > this just needs to set/clear a feature bit in the domain, for VFIO or userspace > to have full control during the different stages of migration of dirty tracking. > In all of the IOMMU implementations/manuals I read it means setting a protection > domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on > the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent > (albeit past work had also it always-on). > > Provided the iommufd does /separately/ more finer granularity on what we can > do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size > at will as separate operations, before and after migration respectivally. That logic > would probably be better to be in separate iommufd ioctls(), as that it's going to be > expensive. This all sounds right to me Questions I have: - Do we need ranges for some reason? You mentioned ARM SMMU wants ranges? how/what/why? - What about the unmap and read dirty without races operation that vfio has? > >> I, too, have been wondering what that is going to look like -- and how do we > >> convey the setup of dirty tracking versus the steering of it. > > > > What I suggested was to just split them. > > > > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this > > would be on a per-domain basis, not on the ioas. > > > > Some ioctl toward the vfio device will turn on the device's tracker. > > > In the activation/fetching-data of either trackers I see some things in common in > terms of UAPI with the difference that whether a device or a list of devices are passed on > as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading > your suggestion) I was thinking a VFIO_DEVICE ioctl located on the device FD implemented in the end VFIO driver (like mlx5_vfio). No lists.. As you say the driver should just take in the request to set dirty tracking and take core of it somehow. There is no value the core VFIO code can add here. > Albeit perhaps the main difference is going to be that one needs to > setup with hardware interface with the device tracker and how we > carry the regions of memory that want to be tracked i.e. GPA/IOVA > ranges that the device should track. The tracking-GPA space is not > linear GPA space sadly. But at the same point perhaps the internal > VFIO API between core-VFIO and vendor-VFIO is just reading the @dma > ranges we mapped. Yes, this is a point that needs some answering. One option is to pass in the tracking range list from userspace. Another is to query it in the driver from the currently mapped areas in IOAS. I know devices have limitations here in terms of how many/how big the ranges can be, and devices probably can't track dynamic changes. > In IOMMU this is sort of cheap and 'stateless', but on the setup of the > device tracker might mean giving all the IOVA ranges to the PF (once?). > Perhaps leaving to the vendor driver to pick when to register the IOVA space to > be tracked, or perhaps when you turn on the device's tracker. But on all cases, > the driver needs some form of awareness of and convey that to the PF for > tracking purposes. Yes, this is right > Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap, > following by a smaller cost copying back to userspace (which KVM does too, when it's using > a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy > (gup as you mentioned earlier in the thread), or just copying based on the input bitmap > (from PF) number of leading zeroes within some threshold. What I would probably strive for is an API that deliberately OR's in the dirty bits. So GUP and kmap a 4k page then call the driver to 'or in your dirty data', then do the next page. etc. That is 134M of IOVA per chunk which seems OK. > > This makes qemu more complicated because it has to decide what > > trackers to turn on, but that is also the point because we do want > > userspace to be able to decide. > > > If the interface wants extending to pass a device or an array of devices (if I understood > you correctly), it would free/simplify VFIO from having to concatenate potentially > different devices bitmaps into one. Albeit would require optimizing the marshalling a bit > more to avoid performing too much copying. Yes. Currently VFIO maintains its own bitmap so it also saves that memory by keeping the dirty bits stored in the IOPTEs until read out. > > The other idea that has some possible interest is to allow the > > trackers to dump their dirty bits into the existing kvm tracker, then > > userspace just does a single kvm centric dirty pass. > > That would probably limit certain more modern options of ring based dirty tracking, > as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least, > would require KVM to always use a bitmap during migration/dirty-rate-estimation with > the presence of vfio/vdpa devices. It's a nice idea, though. It would require making > core-iommu aware of bitmap as external storage for tracking (that is not iommufd as > it's a module). Yes, I don't know enough about kvm to say if that is a great idea or not. The fact the CPUs seem to be going to logging instead of bitmaps suggests it isn't. I don't think DMA devices can work effectively with logging.. Jason
On 2/25/22 20:44, Jason Gunthorpe wrote: > On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >> On 2/23/22 01:03, Jason Gunthorpe wrote: >>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>>>> If by conclusion you mean the whole thing to be merged, how can the work be >>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant >>>>>> in terms of direction... >>>>> >>>>> I think go ahead and build it on top of iommufd, start working out the >>>>> API details, etc. I think once the direction is concluded the new APIs >>>>> will go forward. >>>>> >>>> /me nods, will do. Looking at your repository it is looking good. >>> >>> I would like to come with some plan for dirty tracking on iommufd and >>> combine that with a plan for dirty tracking inside the new migration >>> drivers. >>> >> I had a few things going on my end over the past weeks, albeit it is >> getting a bit better now and I will be coming back to this topic. I hope/want >> to give you a more concrete update/feedback over the coming week or two wrt >> to dirty-tracking+iommufd+amd. >> >> So far, I am not particularly concerned that this will affect overall iommufd >> design. The main thing is really lookups to get vendor iopte, upon on what might >> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling >> the tracking, > > I'm not very keen on these multiplexer interfaces. I think you should > just add a new ops to the new iommu_domain_ops 'set_dirty_tracking' > 'read_dirty_bits' > > NULL op means not supported. > > IMHO we don't need a kapi wrapper if only iommufd is going to call the > op. > OK, good point. Even without a kapi wrapper I am still wondering whether the iommu op needs to be something like a generic iommu feature toggling (e.g. .set_feature()), rather than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about the only feature we will change out-of-band in the protection-domain. I guess I can stay with set_ad_tracking/set_dirty_tracking and if should need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS). Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync' because the sync() doesn't only read, but also "writes" to root pagetable to update the dirty bit (and then IOTLB flush). And that's about what VFIO current interface does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear. And TBH, the question on whether we need a clear op isn't immediately obvious: reading the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what it wants to clear (in principle cheaper on io-page-table walk as you just iterate over sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of get_dirty_bitmap() and clear_dirty_ditmap(). Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is, rather than forcing sw to split/merge pagetables for better granularity and having to flush IOTLBs for dirty to be written again :( >> I'll be simplifying the interface in the other type1 series I had and making it >> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can >> enable/disable over a domain. Perhaps same trick could be expanded to other >> features to have a bit more control on what userspace is allowed to use. I think >> this just needs to set/clear a feature bit in the domain, for VFIO or userspace >> to have full control during the different stages of migration of dirty tracking. >> In all of the IOMMU implementations/manuals I read it means setting a protection >> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on >> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent >> (albeit past work had also it always-on). >> >> Provided the iommufd does /separately/ more finer granularity on what we can >> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size >> at will as separate operations, before and after migration respectivally. That logic >> would probably be better to be in separate iommufd ioctls(), as that it's going to be >> expensive. > > This all sounds right to me > > Questions I have: > - Do we need ranges for some reason? You mentioned ARM SMMU wants > ranges? how/what/why? > Ignore that. I got mislead by the implementation and when I read the SDM I realized that the implementation was doing the same thing I was doing i.e. enabling dirty-bit in the protection domain at start rather than dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD bit in the context descriptor to enable dirty bits or the STE.S2HD in the stream table entry for the stage2 equivalent. Nothing here is per-range basis. And the ranges was only used by the implementation for the eager splitting/merging of IO page table levels. > - What about the unmap and read dirty without races operation that > vfio has? > I am afraid that might need a new unmap iommu op that reads the dirty bit after clearing the page table entry. It's marshalling the bits from iopte into a bitmap as opposed to some logic added on top. As far as I looked for AMD this isn't difficult to add, (same for Intel) it can *I think* reuse all of the unmap code. >>>> I, too, have been wondering what that is going to look like -- and how do we >>>> convey the setup of dirty tracking versus the steering of it. >>> >>> What I suggested was to just split them. >>> >>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this >>> would be on a per-domain basis, not on the ioas. >>> >>> Some ioctl toward the vfio device will turn on the device's tracker. >>> >> In the activation/fetching-data of either trackers I see some things in common in >> terms of UAPI with the difference that whether a device or a list of devices are passed on >> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading >> your suggestion) > > I was thinking a VFIO_DEVICE ioctl located on the device FD > implemented in the end VFIO driver (like mlx5_vfio). No lists.. > Interesting. I was assuming that we wanted to keep the same ioctl() interface for dirty-tracking hence me mentioning the device/device-list on top of the DIRTY ioctl. But well on a second thought it doesn't make sense given that we query a container and we want to move away from vfio for iopgtbl related-work and rather into iommufd direct access instead by the VMM. So placing more dependency on that ioctl wouldn't align with that. I suppose, it makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl(). > As you say the driver should just take in the request to set dirty > tracking and take core of it somehow. There is no value the core VFIO > code can add here. > >> Albeit perhaps the main difference is going to be that one needs to >> setup with hardware interface with the device tracker and how we >> carry the regions of memory that want to be tracked i.e. GPA/IOVA >> ranges that the device should track. The tracking-GPA space is not >> linear GPA space sadly. But at the same point perhaps the internal >> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma >> ranges we mapped. > > Yes, this is a point that needs some answering. One option is to pass > in the tracking range list from userspace. Another is to query it in > the driver from the currently mapped areas in IOAS. > I sort of like the second given that we de-duplicate info that is already tracked by IOAS -- it would be mostly internal and then it would be a matter of when does this device tracker is set up, and whether we should differentiate tracker "start"/"stop" vs "setup"/"teardown". > I know devices have limitations here in terms of how many/how big the > ranges can be, and devices probably can't track dynamic changes. > /me nods Should this be some sort of capability perhaps? Given that this may limit how many concurrent VFs can be migrated and how much ranges it can store, for example. (interestingly and speaking of VF capabilities, it's yet another thing to tackle in the migration stream between src and dst hosts) >> In IOMMU this is sort of cheap and 'stateless', but on the setup of the >> device tracker might mean giving all the IOVA ranges to the PF (once?). >> Perhaps leaving to the vendor driver to pick when to register the IOVA space to >> be tracked, or perhaps when you turn on the device's tracker. But on all cases, >> the driver needs some form of awareness of and convey that to the PF for >> tracking purposes. > > Yes, this is right > >> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap, >> following by a smaller cost copying back to userspace (which KVM does too, when it's using >> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy >> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap >> (from PF) number of leading zeroes within some threshold. > > What I would probably strive for is an API that deliberately OR's in > the dirty bits. So GUP and kmap a 4k page then call the driver to 'or > in your dirty data', then do the next page. etc. That is 134M of IOVA > per chunk which seems OK. > Yeap, sort of what I was aiming for. >>> This makes qemu more complicated because it has to decide what >>> trackers to turn on, but that is also the point because we do want >>> userspace to be able to decide. >>> >> If the interface wants extending to pass a device or an array of devices (if I understood >> you correctly), it would free/simplify VFIO from having to concatenate potentially >> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit >> more to avoid performing too much copying. > > Yes. Currently VFIO maintains its own bitmap so it also saves that > memory by keeping the dirty bits stored in the IOPTEs until read out. > /me nods with the iommu, yes.
On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote: > On 2/25/22 20:44, Jason Gunthorpe wrote: > > On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: > >> On 2/23/22 01:03, Jason Gunthorpe wrote: > >>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: > >>>>>> If by conclusion you mean the whole thing to be merged, how can the work be > >>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant > >>>>>> in terms of direction... > >>>>> > >>>>> I think go ahead and build it on top of iommufd, start working out the > >>>>> API details, etc. I think once the direction is concluded the new APIs > >>>>> will go forward. > >>>>> > >>>> /me nods, will do. Looking at your repository it is looking good. > >>> > >>> I would like to come with some plan for dirty tracking on iommufd and > >>> combine that with a plan for dirty tracking inside the new migration > >>> drivers. > >>> > >> I had a few things going on my end over the past weeks, albeit it is > >> getting a bit better now and I will be coming back to this topic. I hope/want > >> to give you a more concrete update/feedback over the coming week or two wrt > >> to dirty-tracking+iommufd+amd. > >> > >> So far, I am not particularly concerned that this will affect overall iommufd > >> design. The main thing is really lookups to get vendor iopte, upon on what might > >> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling > >> the tracking, > > > > I'm not very keen on these multiplexer interfaces. I think you should > > just add a new ops to the new iommu_domain_ops 'set_dirty_tracking' > > 'read_dirty_bits' > > > > NULL op means not supported. > > > > IMHO we don't need a kapi wrapper if only iommufd is going to call the > > op. > > > > OK, good point. > > Even without a kapi wrapper I am still wondering whether the iommu op needs to > be something like a generic iommu feature toggling (e.g. .set_feature()), rather > than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about > the only feature we will change out-of-band in the protection-domain. > I guess I can stay with set_ad_tracking/set_dirty_tracking and if should > need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS). I just generally dislike multiplexers like this. We are already calling through a function pointer struct, why should the driver implement another switch/case just to find out which function pointer the caller really ment to use? It doesn't make things faster, it doesn't make things smaller, it doesn't use less LOC. Why do it? > Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync' > because the sync() doesn't only read, but also "writes" to root pagetable to update > the dirty bit (and then IOTLB flush). And that's about what VFIO current interface > does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear. 'read and clear' is what I'd suggest > And TBH, the question on whether we need a clear op isn't immediately obvious: reading > the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive > io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what > it wants to clear (in principle cheaper on io-page-table walk as you just iterate over > sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the > IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this > isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my > thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of > get_dirty_bitmap() and clear_dirty_ditmap(). Yes, I wouldn't split them. > > Questions I have: > > - Do we need ranges for some reason? You mentioned ARM SMMU wants > > ranges? how/what/why? > > > Ignore that. I got mislead by the implementation and when I read the SDM > I realized that the implementation was doing the same thing I was > doing Ok > > - What about the unmap and read dirty without races operation that > > vfio has? > > > I am afraid that might need a new unmap iommu op that reads the dirty bit > after clearing the page table entry. It's marshalling the bits from > iopte into a bitmap as opposed to some logic added on top. As far as I > looked for AMD this isn't difficult to add, (same for Intel) it can > *I think* reuse all of the unmap code. Ok. It feels necessary to be complete > > Yes, this is a point that needs some answering. One option is to pass > > in the tracking range list from userspace. Another is to query it in > > the driver from the currently mapped areas in IOAS. > > > I sort of like the second given that we de-duplicate info that is already > tracked by IOAS -- it would be mostly internal and then it would be a > matter of when does this device tracker is set up, and whether we should > differentiate tracker "start"/"stop" vs "setup"/"teardown". One problem with this is that devices that don't support dynamic tracking are stuck in vIOMMU cases where the IOAS will have some tiny set of all memory mapped. So I'd probably say qemu should forward the entire guest CPU memory space, as well as any future memory hotplug area, as ranges and not rely on the IOAS already mapping anything. > > I know devices have limitations here in terms of how many/how big the > > ranges can be, and devices probably can't track dynamic changes. > > > /me nods > > Should this be some sort of capability perhaps? Given that this may limit > how many concurrent VFs can be migrated and how much ranges it can store, > for example. > > (interestingly and speaking of VF capabilities, it's yet another thing to > tackle in the migration stream between src and dst hosts) I don't know what to do with these limitations right now.. Jason
On 2/28/22 21:01, Jason Gunthorpe wrote: > On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote: >> On 2/25/22 20:44, Jason Gunthorpe wrote: >>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >>>> On 2/23/22 01:03, Jason Gunthorpe wrote: >>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be >>>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant >>>>>>>> in terms of direction... >>>>>>> >>>>>>> I think go ahead and build it on top of iommufd, start working out the >>>>>>> API details, etc. I think once the direction is concluded the new APIs >>>>>>> will go forward. >>>>>>> >>>>>> /me nods, will do. Looking at your repository it is looking good. >>>>> >>>>> I would like to come with some plan for dirty tracking on iommufd and >>>>> combine that with a plan for dirty tracking inside the new migration >>>>> drivers. >>>>> >>>> I had a few things going on my end over the past weeks, albeit it is >>>> getting a bit better now and I will be coming back to this topic. I hope/want >>>> to give you a more concrete update/feedback over the coming week or two wrt >>>> to dirty-tracking+iommufd+amd. >>>> >>>> So far, I am not particularly concerned that this will affect overall iommufd >>>> design. The main thing is really lookups to get vendor iopte, upon on what might >>>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling >>>> the tracking, >>> >>> I'm not very keen on these multiplexer interfaces. I think you should >>> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking' >>> 'read_dirty_bits' >>> >>> NULL op means not supported. >>> >>> IMHO we don't need a kapi wrapper if only iommufd is going to call the >>> op. >>> >> >> OK, good point. >> >> Even without a kapi wrapper I am still wondering whether the iommu op needs to >> be something like a generic iommu feature toggling (e.g. .set_feature()), rather >> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about >> the only feature we will change out-of-band in the protection-domain. >> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should >> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS). > > I just generally dislike multiplexers like this. We are already > calling through a function pointer struct, why should the driver > implement another switch/case just to find out which function pointer > the caller really ment to use? It doesn't make things faster, it > doesn't make things smaller, it doesn't use less LOC. Why do it? > I concur with you in the above but I don't mean it like a multiplexer. Rather, mimicking the general nature of feature bits in the protection domain (or the hw pagetable abstraction). So hypothetically for every bit ... if you wanted to create yet another op that just flips a bit of the DTEs/PASID-table/CD it would be an excessive use of callbacks to get to in the iommu_domain_ops if they all set do the same thing. Right now it's only Dirty (Access bit I don't see immediate use for it right now) bits, but could there be more? Perhaps this is something to think about. Anyways, I am anyways sticking with plain ops function. >> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync' >> because the sync() doesn't only read, but also "writes" to root pagetable to update >> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface >> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear. > > 'read and clear' is what I'd suggest > /me nods >>> - What about the unmap and read dirty without races operation that >>> vfio has? >>> >> I am afraid that might need a new unmap iommu op that reads the dirty bit >> after clearing the page table entry. It's marshalling the bits from >> iopte into a bitmap as opposed to some logic added on top. As far as I >> looked for AMD this isn't difficult to add, (same for Intel) it can >> *I think* reuse all of the unmap code. > > Ok. It feels necessary to be complete > Yes, I guess it's mandatory if we want to fully implement vfio-compat dirty tracking IOCTLs, too. >>> Yes, this is a point that needs some answering. One option is to pass >>> in the tracking range list from userspace. Another is to query it in >>> the driver from the currently mapped areas in IOAS. >>> >> I sort of like the second given that we de-duplicate info that is already >> tracked by IOAS -- it would be mostly internal and then it would be a >> matter of when does this device tracker is set up, and whether we should >> differentiate tracker "start"/"stop" vs "setup"/"teardown". > > One problem with this is that devices that don't support dynamic > tracking are stuck in vIOMMU cases where the IOAS will have some tiny > set of all memory mapped. > Sorry to be pedantic, when you say 'dynamic tracking' for you it just means that you have no limitation of ranges and fw/hw can cope with being fed of 'new-ranges' to enable dirty-tracking. Or does it mean something else at hw level? Like dirty bitmap being pulled out of device memory, and hw keeps its own state of dirtying and sw 'just' needs to sync the bitmap every scan.
On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote: > I concur with you in the above but I don't mean it like a multiplexer. > Rather, mimicking the general nature of feature bits in the protection domain > (or the hw pagetable abstraction). So hypothetically for every bit ... if you > wanted to create yet another op that just flips a bit of the > DTEs/PASID-table/CD it would be an excessive use of callbacks to get > to in the iommu_domain_ops if they all set do the same thing. > Right now it's only Dirty (Access bit I don't see immediate use for it > right now) bits, but could there be more? Perhaps this is something to > think about. Not sure it matters :) > > One problem with this is that devices that don't support dynamic > > tracking are stuck in vIOMMU cases where the IOAS will have some tiny > > set of all memory mapped. > > > Sorry to be pedantic, when you say 'dynamic tracking' for you it just means > that you have no limitation of ranges and fw/hw can cope with being fed of > 'new-ranges' to enable dirty-tracking. Yes, the ranges can change once the tracking starts, like the normal IOMMU can do We are looking at devices where the HW can track a range at the start of tracking and that range cannot change over time. So, we need to track the whole guest memory and some extra for memory hot plug, not just the currently viommu mapped things. Jason
On 3/1/22 13:54, Jason Gunthorpe wrote: > On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote: > >> I concur with you in the above but I don't mean it like a multiplexer. >> Rather, mimicking the general nature of feature bits in the protection domain >> (or the hw pagetable abstraction). So hypothetically for every bit ... if you >> wanted to create yet another op that just flips a bit of the >> DTEs/PASID-table/CD it would be an excessive use of callbacks to get >> to in the iommu_domain_ops if they all set do the same thing. >> Right now it's only Dirty (Access bit I don't see immediate use for it >> right now) bits, but could there be more? Perhaps this is something to >> think about. > > Not sure it matters :) > Hehe, most likely it doesn't :) >>> One problem with this is that devices that don't support dynamic >>> tracking are stuck in vIOMMU cases where the IOAS will have some tiny >>> set of all memory mapped. >>> >> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means >> that you have no limitation of ranges and fw/hw can cope with being fed of >> 'new-ranges' to enable dirty-tracking. > > Yes, the ranges can change once the tracking starts, like the normal > IOMMU can do > > We are looking at devices where the HW can track a range at the start > of tracking and that range cannot change over time. > > So, we need to track the whole guest memory and some extra for memory > hot plug, not just the currently viommu mapped things. I'm aware of the guest memmap accounting foo, so I was already factoring that in the equation. I was just mainly grokking at hardware limitations you are coming from to be captured in these future VFIO extensions. Tracking the max possible GPA (as you hinted earlier) could solve the two natures of write-tracking, also given that the IOMMU ultimately protects what's really allowed to be DMA-written to. Unless such tracking-bitmap is placed on device memory, where space probably needs to be more carefully managed. Anyways, thanks for the insights!
On 2/28/22 13:01, Joao Martins wrote: > On 2/25/22 20:44, Jason Gunthorpe wrote: >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >>> On 2/23/22 01:03, Jason Gunthorpe wrote: >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>> I'll be simplifying the interface in the other type1 series I had and making it >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can >>> enable/disable over a domain. Perhaps same trick could be expanded to other >>> features to have a bit more control on what userspace is allowed to use. I think >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace >>> to have full control during the different stages of migration of dirty tracking. >>> In all of the IOMMU implementations/manuals I read it means setting a protection >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent >>> (albeit past work had also it always-on). >>> >>> Provided the iommufd does /separately/ more finer granularity on what we can >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size >>> at will as separate operations, before and after migration respectivally. That logic >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be >>> expensive. >> >> This all sounds right to me >> >> Questions I have: >> - Do we need ranges for some reason? You mentioned ARM SMMU wants >> ranges? how/what/why? >> > Ignore that. I got mislead by the implementation and when I read the SDM > I realized that the implementation was doing the same thing I was doing > i.e. enabling dirty-bit in the protection domain at start rather than > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD > bit in the context descriptor to enable dirty bits or the STE.S2HD in the > stream table entry for the stage2 equivalent. Nothing here is per-range > basis. And the ranges was only used by the implementation for the eager > splitting/merging of IO page table levels. > >> - What about the unmap and read dirty without races operation that >> vfio has? >> > I am afraid that might need a new unmap iommu op that reads the dirty bit > after clearing the page table entry. It's marshalling the bits from > iopte into a bitmap as opposed to some logic added on top. As far as I > looked for AMD this isn't difficult to add, (same for Intel) it can > *I think* reuse all of the unmap code. > OK, made some progress. It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests, revising locking, bugs, and more -- works with my emulated qemu patches which is a good sign. But hopefully starts some sort of skeleton of what we were talking about in the above thread. The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with the vfio-compat one as it was easier provided there's existing userspace to work with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto steering the dirty tracking, read-clear the dirty bits, unmap and get dirty. But as we discussed earlier, the one gap of VFIO dirty API is that it lacks controls for upgrading/downgrading area/IOPTE sizes which is where IOMMUFD would most likely shine. When that latter part is done we can probably adopt an eager-split approach inside vfio-compat. Additionally I also sort of want to skeleton ARM and Intel to see how it looks. Some of the commits made notes of some of research I made, so *I think* the APIs introduced capture h/w semantics for all the three IOMMUs supporting dirty tracking. I am storing my dirty-tracking bits here: https://github.com/jpemartins/linux iommufd It's a version of it rebased on top of your iommufd branch.
On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote: > On 2/28/22 13:01, Joao Martins wrote: > > On 2/25/22 20:44, Jason Gunthorpe wrote: > >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: > >>> On 2/23/22 01:03, Jason Gunthorpe wrote: > >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: > >>> I'll be simplifying the interface in the other type1 series I had and making it > >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can > >>> enable/disable over a domain. Perhaps same trick could be expanded to other > >>> features to have a bit more control on what userspace is allowed to use. I think > >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace > >>> to have full control during the different stages of migration of dirty tracking. > >>> In all of the IOMMU implementations/manuals I read it means setting a protection > >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on > >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent > >>> (albeit past work had also it always-on). > >>> > >>> Provided the iommufd does /separately/ more finer granularity on what we can > >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size > >>> at will as separate operations, before and after migration respectivally. That logic > >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be > >>> expensive. > >> > >> This all sounds right to me > >> > >> Questions I have: > >> - Do we need ranges for some reason? You mentioned ARM SMMU wants > >> ranges? how/what/why? > >> > > Ignore that. I got mislead by the implementation and when I read the SDM > > I realized that the implementation was doing the same thing I was doing > > i.e. enabling dirty-bit in the protection domain at start rather than > > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD > > bit in the context descriptor to enable dirty bits or the STE.S2HD in the > > stream table entry for the stage2 equivalent. Nothing here is per-range > > basis. And the ranges was only used by the implementation for the eager > > splitting/merging of IO page table levels. > > > >> - What about the unmap and read dirty without races operation that > >> vfio has? > >> > > I am afraid that might need a new unmap iommu op that reads the dirty bit > > after clearing the page table entry. It's marshalling the bits from > > iopte into a bitmap as opposed to some logic added on top. As far as I > > looked for AMD this isn't difficult to add, (same for Intel) it can > > *I think* reuse all of the unmap code. > > > > OK, made some progress. Nice! > It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests, > revising locking, bugs, and more -- works with my emulated qemu patches which > is a good sign. But hopefully starts some sort of skeleton of what we were > talking about in the above thread. I'm a bit bogged with the coming merge window right now, but will have more to say in a bit > The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with > the vfio-compat one as it was easier provided there's existing userspace to work > with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto > steering the dirty tracking, read-clear the dirty bits, unmap and > get dirty. I think this is fine to start experimenting with > But as we discussed earlier, the one gap of VFIO dirty API is that > it lacks controls for upgrading/downgrading area/IOPTE sizes which > is where IOMMUFD would most likely shine. When that latter part is > done we can probably adopt an eager-split approach inside > vfio-compat. I think the native API should be new ioctls that operate on the hw_pagetable object to: - enable/disable dirty tracking - read&clear a bitmap from a range - read&unmap a bitmap from a range - Manipulate IOPTE sizes As you say it should not be much distance from the VFIO compat stuff Most probably I would say to leave dirty tracking out of the type1 api and compat for it. Maybe we can make some limited cases work back compat, like the whole ioas supports iommu dirty tracking or something.. Need to understand if it is wortwhile - remember to use any of this you need a qemu that is updated to the v2 migration interface, so there is little practical value in back compat to old qemu if we expect qemu will use the native interface anyhow. > Additionally I also sort of want to skeleton ARM and Intel to see how it looks. > Some of the commits made notes of some of research I made, so *I think* the APIs > introduced capture h/w semantics for all the three IOMMUs supporting dirty > tracking. I think the primitives are pretty basic, I'd be surprised if there is something different :) Though things to be thinking about are how does this work with nested translation and other advanced features. Thanks, Jason
On 3/15/22 19:29, Jason Gunthorpe wrote: > On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote: >> On 2/28/22 13:01, Joao Martins wrote: >>> On 2/25/22 20:44, Jason Gunthorpe wrote: >>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >>>>> On 2/23/22 01:03, Jason Gunthorpe wrote: >>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>>> I'll be simplifying the interface in the other type1 series I had and making it >>>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can >>>>> enable/disable over a domain. Perhaps same trick could be expanded to other >>>>> features to have a bit more control on what userspace is allowed to use. I think >>>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace >>>>> to have full control during the different stages of migration of dirty tracking. >>>>> In all of the IOMMU implementations/manuals I read it means setting a protection >>>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on >>>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent >>>>> (albeit past work had also it always-on). >>>>> >>>>> Provided the iommufd does /separately/ more finer granularity on what we can >>>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size >>>>> at will as separate operations, before and after migration respectivally. That logic >>>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be >>>>> expensive. >>>> >>>> This all sounds right to me >>>> >>>> Questions I have: >>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants >>>> ranges? how/what/why? >>>> >>> Ignore that. I got mislead by the implementation and when I read the SDM >>> I realized that the implementation was doing the same thing I was doing >>> i.e. enabling dirty-bit in the protection domain at start rather than >>> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD >>> bit in the context descriptor to enable dirty bits or the STE.S2HD in the >>> stream table entry for the stage2 equivalent. Nothing here is per-range >>> basis. And the ranges was only used by the implementation for the eager >>> splitting/merging of IO page table levels. >>> >>>> - What about the unmap and read dirty without races operation that >>>> vfio has? >>>> >>> I am afraid that might need a new unmap iommu op that reads the dirty bit >>> after clearing the page table entry. It's marshalling the bits from >>> iopte into a bitmap as opposed to some logic added on top. As far as I >>> looked for AMD this isn't difficult to add, (same for Intel) it can >>> *I think* reuse all of the unmap code. >>> >> >> OK, made some progress. > > Nice! > > >> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests, >> revising locking, bugs, and more -- works with my emulated qemu patches which >> is a good sign. But hopefully starts some sort of skeleton of what we were >> talking about in the above thread. > > I'm a bit bogged with the coming merge window right now, but will have > more to say in a bit > Thanks! Take your time :) >> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with >> the vfio-compat one as it was easier provided there's existing userspace to work >> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto >> steering the dirty tracking, read-clear the dirty bits, unmap and >> get dirty. > > I think this is fine to start experimenting with > >> But as we discussed earlier, the one gap of VFIO dirty API is that >> it lacks controls for upgrading/downgrading area/IOPTE sizes which >> is where IOMMUFD would most likely shine. When that latter part is >> done we can probably adopt an eager-split approach inside >> vfio-compat. > > I think the native API should be new ioctls that operate on the > hw_pagetable object to: > - enable/disable dirty tracking > - read&clear a bitmap from a range > - read&unmap a bitmap from a range > - Manipulate IOPTE sizes > Yeap -- heh nice, it is roughly what I was into already. I will take the opportunity of doing the zerocopy/gup of the bitmap when writing the iommufd native ioctls. The vfio-compat is still copying. other things on my mind for iommufd native interface are: 1) The iopte split *could* be while we read the dirty bits of a pgsize != than the args pgsize. The read_and_clear_dirty() is already expensive, but I wonder about the idea to take advantage of the dirty-clearing needed IOTLB flush, to also also update the IOPTEs on the next memory-request/translation-request. Albeit iopte collpase still needs to be routed via a specific ioctl() to promote to a higher page size (should the migration fail or something). (I referred to this in the past as lazy-split contrast to explicit IOPTE size manipulation iommufd ioctls i.e. {manual,eager}-split) 2) I was thinking io_pagetable new APIs instead of hw_pagetable, but if you're sort of seeing this as a hw_pagetable object primitive, then I am wondering why? Albeit at the end of the day we operate on the iopt inside the hw_pagetable. Note: IIUC that hw_pagetable is supposed to model a iommu_domain direct manipulation which dirty fits in, but wouldn't that be applicable to _MAP and _UNMAP too maybe? > As you say it should not be much distance from the VFIO compat stuff > > Most probably I would say to leave dirty tracking out of the type1 api > and compat for it. Maybe we can make some limited cases work back > compat, like the whole ioas supports iommu dirty tracking or > something.. > > Need to understand if it is wortwhile - remember to use any of this > you need a qemu that is updated to the v2 migration interface, > so there is little practical value in back compat to old qemu if we > expect qemu will use the native interface anyhow. > Perhaps the value is really just *how much* applications need to get rewritten to adjust to iommufd versus iteratively adding small (new) parts of it . migration-v2 didn't turn into a complete rewrite of the vfio initial part iiuc. Albeit I suspect keeping the compat was perhaps a struggle to keep, which we might not what we desire if the semantics diverge too much. e.g. If the semantics are about the same, for example, new VMMs could use the iommufd new features using IOPTE size change ioctls() via the VFIO_IOAS (if available) while leaving vfio-compat somewhat usable on the existent dirty ioctls. While slowly moving to newer iommufd (and less vfio-iommu-type1) until the point you totally deprecate it. Even without IOPTE size modification in vfio-compat you would still mark as dirty a lot less than the current logic would (which really represents the whole bitmap). I guess at the end of the day it is the maintenance cost -- Fortunately APIs are looking similar and most of vfio-compat is dealing with userspace args/validation. Anyhow just some thoughts, open to anything really :) >> Additionally I also sort of want to skeleton ARM and Intel to see how it looks. >> Some of the commits made notes of some of research I made, so *I think* the APIs >> introduced capture h/w semantics for all the three IOMMUs supporting dirty >> tracking. > > I think the primitives are pretty basic, I'd be surprised if there is > something different :) > Yeah, they are very basic. My main unknown in the rework was how the hw protection domain context can deal with dynamically switching dirty on and off. But I think I answered myself that restriction at least from groking into specs and experimentation. > Though things to be thinking about are how does this work with nested > translation and other advanced features /me nods There's usually some flags/bits on the 'stage 2' domain, at least ARM (S2CD) and Intel (second-level table). But for AMD, you usually control via the vIOMMU what features the hw exposes[*], and I think there's an extra bit to set to disable vIOMMU dirty tracking from VMM. But I think still the only thing needed is first-stage tracking, as that's what matters for L0 live migration. [*] AMD vIOMMU also offloads some of the command processing, not just the pagetables. Not sure if ARM/Intel has the equivalent.
On 3/16/22 16:36, Joao Martins wrote: > On 3/15/22 19:29, Jason Gunthorpe wrote: >> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote: >>> On 2/28/22 13:01, Joao Martins wrote: >>>> On 2/25/22 20:44, Jason Gunthorpe wrote: >>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote: >>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>>> Questions I have: >>>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants >>>>> ranges? how/what/why? >>>>> An amend here. Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing the ARM side. Apparently you have two controls: one is a 'feature bit' just like x86 and another is a modifier (arm-only). The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor equivalent for second-level. That's the top-level enabler to actually a *second* modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so called DBM (dirty-bit-modifier). The latter when set, changes the meaning of read/write access-flags of the PTE AP[2]. If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a transition in the SMMU from "writable Clean" to "written" means that the the access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0) if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault like it normally would be with DBM = 0). Same thing for stage-2, except that the access-bits are reversed (S2AP[1] is set when "written" and it's cleared when it's "writable" (when DBM is also set). Now you could say that this allows you to control on a per-range basis. Gah, no, more like a per-PTE basis is more accurate. And in practice I suppose that means that dynamically switching on/off SMMU dirty-tracking *dynamically* means not only setting CD.HD but also walking the page tables, and atomically setting/clearing both the DBM and AP[2]. References: DDI0487H, Table D5-30 Data access permissions SMMU 3.2 spec, 3.13.3 Dirty flag hardware update
On 3/16/22 20:37, Joao Martins wrote: > On 3/16/22 16:36, Joao Martins wrote: >> On 3/15/22 19:29, Jason Gunthorpe wrote: >>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote: >>>> On 2/28/22 13:01, Joao Martins wrote: >>>>> On 2/25/22 20:44, Jason Gunthorpe wrote: >>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote: >>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote: >>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>>>> Questions I have: >>>>>> - Do we need ranges for some reason? You mentioned ARM SMMU wants >>>>>> ranges? how/what/why? >>>>>> > > An amend here. > > Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing > the ARM side. Apparently you have two controls: one is a 'feature bit' > just like x86 and another is a modifier (arm-only). > > The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor > equivalent for second-level. That's the top-level enabler to actually a *second* > modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so > called DBM (dirty-bit-modifier). The latter when set, changes the meaning of > read/write access-flags of the PTE AP[2]. > > If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a > transition in the SMMU from "writable Clean" to "written" means that the the > access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0) > if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault > like it normally would be with DBM = 0). Same thing for stage-2, except that > the access-bits are reversed (S2AP[1] is set when "written" and it's cleared > when it's "writable" (when DBM is also set). > > Now you could say that this allows you to control on a per-range basis. > Gah, no, more like a per-PTE basis is more accurate. > > And in practice I suppose that means that dynamically switching on/off SMMU > dirty-tracking *dynamically* means not only setting CD.HD but also walking the > page tables, and atomically setting/clearing both the DBM and AP[2]. > > References: > > DDI0487H, Table D5-30 Data access permissions > SMMU 3.2 spec, 3.13.3 Dirty flag hardware update I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based on the past work) and adjusted the 'set tracking' structure to cover this slightly different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API, which is internal in iommufd obviously. The UAPI won't change ofc. It's only compile-tested sadly as I have no SMMUv3.2 hardware, and to have this SMMUv3 DBM/HTTU support there's some requirements on the processor that I am not sure they can be fully emulated. The Intel iommu implementation follows same model as AMD and I will get to that next with a compliant iommu emulation too. But now, I will be focusing on hw_pagetable UAPI part. I understand your thinking of being tied to the hw_pagetable obj as opposed to IOAS for the dirty tracking APIs.
On Fri, Mar 18, 2022 at 05:12:01PM +0000, Joao Martins wrote: > I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based > on the past work) and adjusted the 'set tracking' structure to cover this slightly > different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API, > which is internal in iommufd obviously. The UAPI won't change ofc. Shameerali has HW, AFAIK.. Jason