mbox series

[RFC,v2,0/4] vfio/hisilicon: add acc live migration driver

Message ID 20210702095849.1610-1-shameerali.kolothum.thodi@huawei.com (mailing list archive)
Headers show
Series vfio/hisilicon: add acc live migration driver | expand

Message

Shameerali Kolothum Thodi July 2, 2021, 9:58 a.m. UTC
Hi,

This series attempts to add vfio live migration support for
HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
includes both the functional register space and migration 
control register space. As discussed in RFCv1[0], this may create
security issues as these regions get shared between the Guest
driver and the migration driver. Based on the feedback, we tried
to address those concerns in this version. 

This is now based on the new vfio-pci-core framework proposal[1].
Understand that the framework proposal is still under discussion,
but really appreciate any feedback on the approach taken here
to mitigate the security risks.

The major changes from v1 are,

 -Adds a new vendor-specific vfio_pci driver(hisi-acc-vfio-pci)
  for HiSilicon ACC VF devices based on the new vfio-pci-core
  framework proposal.

 -Since HiSilicon ACC VF device MMIO space contains both the
  functional register space and migration control register space,
  override the vfio_device_ops ioctl method to report only the
  functional space to VMs.

 -For a successful migration, we still need access to VF dev
  functional register space mainly to read the status registers.
  But accessing these while the Guest vCPUs are running may leave
  a security hole. To avoid any potential security issues, we
  map/unmap the MMIO regions on a need basis and is safe to do so.
  (Please see hisi_acc_vf_ioremap/unmap() fns in patch #4).
 
 -Dropped debugfs support for now.
 -Uses common QM functions for mailbox access(patch #3).

This is now sanity tested on HiSilicon platforms that support these
ACC devices.

From v1:
 -In order to ensure the compatibility of the devices before and
  after the migration, the device compatibility information check
  will be performed in the Pre-copy stage. If the check fails,
  an error will be returned and the source VM will exit the migration.
 -After the compatibility check is passed, it will enter the
  Stop-and-copy stage. At this time, all the live migration data
  will be copied and saved to the VF device of the destination.
  Finally, the VF device of the destination will be started.

Thanks,
Shameer
[0] https://lore.kernel.org/lkml/20210415220137.GA1672608@nvidia.com/
[1] https://lore.kernel.org/lkml/20210603160809.15845-1-mgurtovoy@nvidia.com/

Longfang Liu (2):
  crypto: hisilicon/qm - Export mailbox functions for common use
  hisi_acc_vfio_pci: Add support for vfio live migration

Shameer Kolothum (2):
  hisi-acc-vfio-pci: add new vfio_pci driver for HiSilicon ACC devices
  hisi_acc_vfio_pci: Override ioctl method to limit BAR2 region size

 drivers/crypto/hisilicon/qm.c        |    8 +-
 drivers/crypto/hisilicon/qm.h        |    4 +
 drivers/vfio/pci/Kconfig             |   13 +
 drivers/vfio/pci/Makefile            |    2 +
 drivers/vfio/pci/hisi_acc_vfio_pci.c | 1155 ++++++++++++++++++++++++++
 drivers/vfio/pci/hisi_acc_vfio_pci.h |  144 ++++
 6 files changed, 1323 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.c
 create mode 100644 drivers/vfio/pci/hisi_acc_vfio_pci.h

Comments

Jason Gunthorpe Feb. 2, 2022, 1:14 p.m. UTC | #1
On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> This series attempts to add vfio live migration support for
> HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> includes both the functional register space and migration 
> control register space. As discussed in RFCv1[0], this may create
> security issues as these regions get shared between the Guest
> driver and the migration driver. Based on the feedback, we tried
> to address those concerns in this version. 
> 
> This is now based on the new vfio-pci-core framework proposal[1].
> Understand that the framework proposal is still under discussion,
> but really appreciate any feedback on the approach taken here
> to mitigate the security risks.

Hi, can you look at the v6 proposal for the mlx5 implementation of the
migration API and see if it meets hisilicon acc's needs as well?

https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/

There are few topics to consider:
 - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
   sense for this driver?

   I see pf_qm_state_pre_save() but didn't understand why it wanted to
   send the first 32 bytes in the PRECOPY mode? It is fine, but it
   will add some complexity to continue to do this.

 - I think we discussed the P2P implementation and decided it would
   work for this device? Can you re-read and confirm?

 - Are the arcs we defined going to work here as well? The current
   implementation in hisi_acc_vf_set_device_state() is very far away
   from what the v1 protocol is, so I'm having a hard time guessing,
   but..

      RESUMING -> STOP
        Probably vf_qm_state_resume()

      RUNNING -> STOP
        vf_qm_fun_restart() - that is oddly named..

      STOP -> RESUMING
        Seems to be a nop (likely a bug)

      STOP -> RUNNING
         Not implemented currenty? (also a bug)

      STOP -> STOP_COPY
         pf_qm_state_pre_save / vf_qm_state_save

      STOP_COPY -> STOP
         NOP

   And the modification for the P2P/NO DMA is presumably just
   fun_restart too since stopping the device and stopping DMA are
   going to be the same thing here?

The mlx5 implementation linked above is a full example you can cut and
paste from for how to implement the state function and the how to do
the data transfer. The f_ops read/write implementation for acc looks
trivial as it only streams the fixed size and pre-allocated 'struct
acc_vf_data'

It looks like it would be a short path to implement our v2 proposal
and remove a lot of driver code, as we saw in mlx5.

Thanks,
Jason
Shameerali Kolothum Thodi Feb. 2, 2022, 2:34 p.m. UTC | #2
> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 02 February 2022 13:15
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> 
> On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> > This series attempts to add vfio live migration support for
> > HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> > includes both the functional register space and migration
> > control register space. As discussed in RFCv1[0], this may create
> > security issues as these regions get shared between the Guest
> > driver and the migration driver. Based on the feedback, we tried
> > to address those concerns in this version.
> >
> > This is now based on the new vfio-pci-core framework proposal[1].
> > Understand that the framework proposal is still under discussion,
> > but really appreciate any feedback on the approach taken here
> > to mitigate the security risks.
> 
> Hi, can you look at the v6 proposal for the mlx5 implementation of the
> migration API and see if it meets hisilicon acc's needs as well?
> 
> https://lore.kernel.org/all/20220130160826.32449-1-yishaih@nvidia.com/

Yes, I saw that one. Thanks for that and is now looking into it.

> 
> There are few topics to consider:
>  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
>    sense for this driver?

I think it will be STOP_COPY only for now. We might have PRECOPY feature once
we have the SMMUv3 HTTU support in future.

> 
>    I see pf_qm_state_pre_save() but didn't understand why it wanted to
>    send the first 32 bytes in the PRECOPY mode? It is fine, but it
>    will add some complexity to continue to do this.

That was mainly to do a quick verification between src and dst compatibility
before we start saving the state. I think probably we can delay that check
for later.

>  - I think we discussed the P2P implementation and decided it would
>    work for this device? Can you re-read and confirm?

In our case these devices are Integrated End Point devices and doesn't have
P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature
I guess. Also, since we cannot guarantee a NDMA state in STOP, my
assumption currently is the onus of making sure that no MMIO access happens 
in STOP is on the user. Is that a valid assumption?

>  - Are the arcs we defined going to work here as well? The current
>    implementation in hisi_acc_vf_set_device_state() is very far away
>    from what the v1 protocol is, so I'm having a hard time guessing,
>    but..

Right. The FSM has changed a couple of times since we posted this.
I am going to rebase all that now.

>       RESUMING -> STOP
>         Probably vf_qm_state_resume()
> 
>       RUNNING -> STOP
>         vf_qm_fun_restart() - that is oddly named..
> 
>       STOP -> RESUMING
>         Seems to be a nop (likely a bug)
> 
>       STOP -> RUNNING
>          Not implemented currenty? (also a bug)
> 
>       STOP -> STOP_COPY
>          pf_qm_state_pre_save / vf_qm_state_save
> 
>       STOP_COPY -> STOP
>          NOP

I will check and verify this.

>    And the modification for the P2P/NO DMA is presumably just
>    fun_restart too since stopping the device and stopping DMA are
>    going to be the same thing here?

Yes, in our case stopping device and stopping DMA are effectively the
same thing.

> 
> The mlx5 implementation linked above is a full example you can cut and
> paste from for how to implement the state function and the how to do
> the data transfer. The f_ops read/write implementation for acc looks
> trivial as it only streams the fixed size and pre-allocated 'struct
> acc_vf_data'
> 
> It looks like it would be a short path to implement our v2 proposal
> and remove a lot of driver code, as we saw in mlx5.
> 

Ok. These are the git repo I am using for the rework,
https://github.com/jgunthorpe/qemu/commits/vfio_migration_v2
https://github.com/jgunthorpe/linux/tree/vfio_migration_v2

Please let me know if the above are not up to date.

Also, just noted that my quick prototype is now failing
with below error,

" Error: VFIO device doesn't support migration"

Do we need to set the below before the feature query?
Or am I using a wrong Qemu/kernel repo?

--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
*vbasedev, uint64_t *mig_flags)
     struct vfio_device_feature_migration *mig = (void *)feature->data;

     feature->argsz = sizeof(buf);
+    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET;
     if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
         return -EOPNOTSUPP;

Thanks,
Shameer
Jason Gunthorpe Feb. 2, 2022, 3:39 p.m. UTC | #3
On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi wrote:

> > There are few topics to consider:
> >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> >    sense for this driver?
> 
> I think it will be STOP_COPY only for now. We might have PRECOPY feature once
> we have the SMMUv3 HTTU support in future.

HTTU is the dirty tracking feature? To be clear VFIO migration support
for PRECOPY has nothing to do with IOMMU based dirty page tracking.
 
> >  - I think we discussed the P2P implementation and decided it would
> >    work for this device? Can you re-read and confirm?
> 
> In our case these devices are Integrated End Point devices and doesn't have
> P2P DMA capability. Hence the FSM arcs will be limited to STOP_COPY feature
> I guess. Also, since we cannot guarantee a NDMA state in STOP, my
> assumption currently is the onus of making sure that no MMIO access happens 
> in STOP is on the user. Is that a valid assumption?

Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO
access to sustain it.

(and I'm wondering sometimes if we should rename RUNNING_P2P to
STOP_P2P - ie the device is stopped but still allows inbound P2P to
make this clearer)

> Do we need to set the below before the feature query?
> Or am I using a wrong Qemu/kernel repo?
> 
> +++ b/hw/vfio/migration.c
> @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
> *vbasedev, uint64_t *mig_flags)
>      struct vfio_device_feature_migration *mig = (void *)feature->data;
> 
>      feature->argsz = sizeof(buf);
> +    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION | VFIO_DEVICE_FEATURE_GET;
>      if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
>          return -EOPNOTSUPP;

Oh, this is my mistake I thought this got pushed to that github
already but didn't, I updated it.

If you have a prototype can you post another RFC?

Thanks,
Jason
Shameerali Kolothum Thodi Feb. 2, 2022, 4:10 p.m. UTC | #4
> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 02 February 2022 15:40
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> 
> On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > > There are few topics to consider:
> > >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> > >    sense for this driver?
> >
> > I think it will be STOP_COPY only for now. We might have PRECOPY
> > feature once we have the SMMUv3 HTTU support in future.
> 
> HTTU is the dirty tracking feature? To be clear VFIO migration support for
> PRECOPY has nothing to do with IOMMU based dirty page tracking.

Yes, it is based on the IOMMU hardware dirty bit management support.
A RFC was posted sometime back,
https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/

Ok, my guess was that the PRECOPY here was related. Thanks for clarifying.

> 
> > >  - I think we discussed the P2P implementation and decided it would
> > >    work for this device? Can you re-read and confirm?
> >
> > In our case these devices are Integrated End Point devices and doesn't
> > have P2P DMA capability. Hence the FSM arcs will be limited to
> > STOP_COPY feature I guess. Also, since we cannot guarantee a NDMA
> > state in STOP, my assumption currently is the onus of making sure that
> > no MMIO access happens in STOP is on the user. Is that a valid assumption?
> 
> Yes, you can treat RUNNING_P2P as the same as STOP and rely on no MMIO
> access to sustain it.

Ok.
 
> (and I'm wondering sometimes if we should rename RUNNING_P2P to
> STOP_P2P - ie the device is stopped but still allows inbound P2P to make this
> clearer)
> 
> > Do we need to set the below before the feature query?
> > Or am I using a wrong Qemu/kernel repo?
> >
> > +++ b/hw/vfio/migration.c
> > @@ -488,6 +488,7 @@ static int vfio_migration_query_flags(VFIODevice
> > *vbasedev, uint64_t *mig_flags)
> >      struct vfio_device_feature_migration *mig = (void
> > *)feature->data;
> >
> >      feature->argsz = sizeof(buf);
> > +    feature->flags = VFIO_DEVICE_FEATURE_MIGRATION |
> > + VFIO_DEVICE_FEATURE_GET;
> >      if (ioctl(vbasedev->fd, VFIO_DEVICE_FEATURE, feature) != 0)
> >          return -EOPNOTSUPP;
> 
> Oh, this is my mistake I thought this got pushed to that github already but
> didn't, I updated it.

Ok. Thanks.
 
> If you have a prototype can you post another RFC?

Sure, will do. I just started and has only a skeleton proto based on v2 now.
Will send out a RFC soon once I have all the FSM arcs implemented 
and sanity tested.

Thanks,
Shameer
Jason Gunthorpe Feb. 2, 2022, 5:03 p.m. UTC | #5
On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote:
> 
> 
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 02 February 2022 15:40
> > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> > Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
> > linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
> > mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
> > <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
> > yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
> > <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> > Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
> > 
> > On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
> > wrote:
> > 
> > > > There are few topics to consider:
> > > >  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
> > > >    sense for this driver?
> > >
> > > I think it will be STOP_COPY only for now. We might have PRECOPY
> > > feature once we have the SMMUv3 HTTU support in future.
> > 
> > HTTU is the dirty tracking feature? To be clear VFIO migration support for
> > PRECOPY has nothing to do with IOMMU based dirty page tracking.
> 
> Yes, it is based on the IOMMU hardware dirty bit management support.
> A RFC was posted sometime back,
> https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/

Yes, I saw that. I was hoping to have a discussion on this soon about
how to integrate that with the iommufd work, which I hope will allow
that series, and the other IOMMU drivers that can support this to be
merged..

Jason
Alex Williamson Feb. 2, 2022, 5:30 p.m. UTC | #6
On Wed, 2 Feb 2022 14:34:52 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> 
> > 
> >    I see pf_qm_state_pre_save() but didn't understand why it wanted to
> >    send the first 32 bytes in the PRECOPY mode? It is fine, but it
> >    will add some complexity to continue to do this.  
> 
> That was mainly to do a quick verification between src and dst compatibility
> before we start saving the state. I think probably we can delay that check
> for later.

In the v1 migration scheme, this was considered good practice.  It
shouldn't be limited to PRECOPY, as there's no requirement to use
PRECOPY, but the earlier in the migration process that we can trigger a
device or data stream compatibility fault, the better.  TBH, even in
the case where a device doesn't support live dirty tracking for a
PRECOPY phase, using it for compatibility testing continues to seem
like good practice.  Thanks,

Alex
Jason Gunthorpe Feb. 2, 2022, 6:04 p.m. UTC | #7
On Wed, Feb 02, 2022 at 10:30:41AM -0700, Alex Williamson wrote:
> On Wed, 2 Feb 2022 14:34:52 +0000
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:
> 
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > 
> > > 
> > >    I see pf_qm_state_pre_save() but didn't understand why it wanted to
> > >    send the first 32 bytes in the PRECOPY mode? It is fine, but it
> > >    will add some complexity to continue to do this.  
> > 
> > That was mainly to do a quick verification between src and dst compatibility
> > before we start saving the state. I think probably we can delay that check
> > for later.
> 
> In the v1 migration scheme, this was considered good practice.  It
> shouldn't be limited to PRECOPY, as there's no requirement to use
> PRECOPY, but the earlier in the migration process that we can trigger a
> device or data stream compatibility fault, the better.  TBH, even in
> the case where a device doesn't support live dirty tracking for a
> PRECOPY phase, using it for compatibility testing continues to seem
> like good practice.

At least with our thinking here, we'd rather the device expose an
explicit compatibility data via get/test system calls so we can build
proper infrastructure around this.

Every device will have compatibility requirements and we can build
more shared common code this way. ie qemu can ideally fetch the data
before migration starts and do an exchange with the live migration
target to see if it is OK. Orchestration can inventory the systems,
and automation can select live migration targets that can actually
work.

If it is hidden inside the migration stream it is too invisible to be
fully useful.

This is something we've been talking about here but don't have much
concrete to say for mlx5 yet.

The device still has to self-protect itself against a corrupted
migration stream impacting integrity, of course.

IIRC qemu has a nice spot to put this in the existing protocol.

Just overall, now that PRECOPY is optional, we should avoid using it
without a good reason. The driver implementation does have a cost.

Thanks,
Jason
Joao Martins Feb. 2, 2022, 7:05 p.m. UTC | #8
On 2/2/22 17:03, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 04:10:06PM +0000, Shameerali Kolothum Thodi wrote:
>>> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
>>> Sent: 02 February 2022 15:40
>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>>> Cc: kvm@vger.kernel.org; linux-kernel@vger.kernel.org;
>>> linux-crypto@vger.kernel.org; alex.williamson@redhat.com;
>>> mgurtovoy@nvidia.com; Linuxarm <linuxarm@huawei.com>; liulongfang
>>> <liulongfang@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>;
>>> yuzenghui <yuzenghui@huawei.com>; Jonathan Cameron
>>> <jonathan.cameron@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
>>> Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
>>>
>>> On Wed, Feb 02, 2022 at 02:34:52PM +0000, Shameerali Kolothum Thodi
>>> wrote:
>>>
>>>>> There are few topics to consider:
>>>>>  - Which of the three feature sets (STOP_COPY, P2P and PRECOPY) make
>>>>>    sense for this driver?
>>>>
>>>> I think it will be STOP_COPY only for now. We might have PRECOPY
>>>> feature once we have the SMMUv3 HTTU support in future.
>>>
>>> HTTU is the dirty tracking feature? To be clear VFIO migration support for
>>> PRECOPY has nothing to do with IOMMU based dirty page tracking.
>>
>> Yes, it is based on the IOMMU hardware dirty bit management support.
>> A RFC was posted sometime back,
>> https://lore.kernel.org/kvm/20210507103608.39440-1-zhukeqian1@huawei.com/
> 
> Yes, I saw that. I was hoping to have a discussion on this soon about

Sorry to snatch the thread, I was thinking in doing the same this week,
as I was playing around that area, and wanted to ask the community... but
since you mentioned it :D

> how to integrate that with the iommufd work, which I hope will allow
> that series, and the other IOMMU drivers that can support this to be
> merged..

The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
there, but TBH I am not up to speed on iommu-fd yet so I missed something
obvious for sure. When you say 'integrate that with the iommufd' can you
expand on that?

Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
/internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
VMM to scan iommu pagetables itself and see what was marked dirty or not?

My over-simplistic/naive view was that the proposal in the link above sounded a lot
simpler. While iommu-fd had more longevity for many other usecases outside dirty tracking, no?

I have a PoC-ish using the interface in the link, with AMD IOMMU dirty bit supported
(including Qemu emulated amd-iommu for folks lacking the hardware). Albeit the
eager-spliting + collapsing of IOMMU hugepages is not yet done there, and I wanted to play
around the emulated intel-iommu SLADS from specs looks quite similar. Happy to join
existing effort anyways.
Jason Gunthorpe Feb. 3, 2022, 3:18 p.m. UTC | #9
On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> On 2/2/22 17:03, Jason Gunthorpe wrote:

> > how to integrate that with the iommufd work, which I hope will allow
> > that series, and the other IOMMU drivers that can support this to be
> > merged..
> 
> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> obvious for sure. When you say 'integrate that with the iommufd' can you
> expand on that?

The general idea is that iommufd is the place to put all the iommu
driver uAPI for consumption by userspace. The IOMMU feature of dirty
tracking would belong there.

So, some kind of API needs to be designed to meet the needs of the
IOMMU drivers.

> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
> VMM to scan iommu pagetables itself and see what was marked dirty or
> not?

iommufd and VFIO container's don't co-exist, either iommufd is
providing the IOMMU interface, or the current type 1 code - not both
together.

iommfd current approach presents the same ABI as the type1 container
as compatability, and it is a possible direction to provide the
iommu_domain stored dirty bits through that compat API.

But, as you say, it looks unnatural and inefficient when the domain
itself is storing the dirty bits inside the IOPTE.

It need some study I haven't got into yet :)

> My over-simplistic/naive view was that the proposal in the link
> above sounded a lot simpler. While iommu-fd had more longevity for
> many other usecases outside dirty tracking, no?

I'd prefer we don't continue to hack on the type1 code if iommufd is
expected to take over in this role - especially for a multi-vendor
feature like dirty tracking.

It is actually a pretty complicated topic because migration capable
PCI devices are also include their own dirty tracking HW, all this
needs to be harmonized somehow. VFIO proposed to squash everything
into the container code, but I've been mulling about having iommufd
only do system iommu and push the PCI device internal tracking over to
VFIO.

> I have a PoC-ish using the interface in the link, with AMD IOMMU
> dirty bit supported (including Qemu emulated amd-iommu for folks
> lacking the hardware). Albeit the eager-spliting + collapsing of
> IOMMU hugepages is not yet done there, and I wanted to play around
> the emulated intel-iommu SLADS from specs looks quite similar. Happy
> to join existing effort anyways.

This sounds great, I would love to bring the AMD IOMMU along with a
dirty tracking implementation! Can you share some patches so we can
see what the HW implementation looks like?

Jason
Joao Martins Feb. 4, 2022, 7:53 p.m. UTC | #10
On 2/3/22 15:18, Jason Gunthorpe wrote:
> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>> how to integrate that with the iommufd work, which I hope will allow
>>> that series, and the other IOMMU drivers that can support this to be
>>> merged..
>>
>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>> obvious for sure. When you say 'integrate that with the iommufd' can you
>> expand on that?
> 
> The general idea is that iommufd is the place to put all the iommu
> driver uAPI for consumption by userspace. The IOMMU feature of dirty
> tracking would belong there.
> 
> So, some kind of API needs to be designed to meet the needs of the
> IOMMU drivers.
> 
/me nods

I am gonna assume below is the most up-to-date to iommufd (as you pointed
out in another thread IIRC):

  https://github.com/jgunthorpe/linux iommufd

Let me know if it's not :)

>> Did you meant to use interface in the link, or perhaps VFIO would use an iommufd
>> /internally/ but still export the same UAPI as VFIO dirty tracking ioctls() (even if it's
>> not that efficient with a lot of bitmap copying). And optionally give a iommu_fd for the
>> VMM to scan iommu pagetables itself and see what was marked dirty or
>> not?
> 
> iommufd and VFIO container's don't co-exist, either iommufd is
> providing the IOMMU interface, or the current type 1 code - not both
> together.
> 
> iommfd current approach presents the same ABI as the type1 container
> as compatability, and it is a possible direction to provide the
> iommu_domain stored dirty bits through that compat API.
> 
> But, as you say, it looks unnatural and inefficient when the domain
> itself is storing the dirty bits inside the IOPTE.
> 
How much is this already represented as the io-pgtable in IOMMU internal kAPI
(if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

> It need some study I haven't got into yet :)
> 
Heh :)

Depending on how easy it is to obtain full extent of IO pagetables via iommu_fd
and whether userspace code can scan the dirty bits on its own ... then potentially
VMM/process can more efficiently scan the dirtied set? But if some layer needs to
somehow mediate between the vendor IOPTE representation and an UAPI IOPTE representation,
to be able to make that delegation to userspace ... then maybe both might be inefficient?
I didn't see how iommu-fd would abstract the IOPTEs lookup as far as I glanced through the
code, perhaps that's another ioctl().

>> My over-simplistic/naive view was that the proposal in the link
>> above sounded a lot simpler. While iommu-fd had more longevity for
>> many other usecases outside dirty tracking, no?
> 
> I'd prefer we don't continue to hack on the type1 code if iommufd is
> expected to take over in this role - especially for a multi-vendor
> feature like dirty tracking.
> 
But what strikes /specifically/ on the dirty bit feature is that it looks
simpler with the current VFIO, the heavy lifting seems to be
mostly on the IOMMU vendor. The proposed API above for VFIO looking at
the container (small changes), and IOMMU vendor would do most of it:

* Toggling API for start/stop dirty tracking
* API to get the IOVAs dirtied
* API to clear the IOVAs dirtied
[this last one I am not sure it is needed as the clear could be done as we scan
 the ones, thus minimize TLB flush cost if these are separate]

At the same time, what particularly scares me perf-wise (for the device being migrated)
... is the fact that we need to dynamically split and collapse page tables to
increase the granularity of which we track. In the above interface it splits/collapses
when you turn on/off the dirty tracking (respectively). That's *probably* where we
need more flexibility, not sure.

> It is actually a pretty complicated topic because migration capable
> PCI devices are also include their own dirty tracking HW, all this
> needs to be harmonized somehow. 

Do you have thoughts on what such device-dirty interface could look like?
(Perhaps too early to poke while the FSM/UAPI is being worked out)

I was wondering if container has a dirty scan/sync callback funnelled
by a vendor IOMMU ops implemented (as Shameerali patches proposed), and vfio vendor driver
provides one per device. Or propagate the dirty tracking API to vendor vfio driver[*]. The
reporting of the dirtying, though, looks hazzy to achieve if you try to make
it uniform even to userspace. Perhaps with iommu-fd you're thinking to mmap()
the dirty region back to userspace, or an iommu-fd ioctl() updates the PTEs,
while letting the kernel clear the dirty status via the mmap() object. And that
would be the common API regardless of dirty-hw scheme. Anyway, just thinking
out loud.

[*] considering the device may choose where to place its tracking storage, and
which scheme (bitmap, ring, etc) it might be.

> VFIO proposed to squash everything
> into the container code, but I've been mulling about having iommufd
> only do system iommu and push the PCI device internal tracking over to
> VFIO.
> 

Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
not sure yet how much one can we 'offload' to a generic layer, at least
compared with this other proposal.

>> I have a PoC-ish using the interface in the link, with AMD IOMMU
>> dirty bit supported (including Qemu emulated amd-iommu for folks
>> lacking the hardware). Albeit the eager-spliting + collapsing of
>> IOMMU hugepages is not yet done there, and I wanted to play around
>> the emulated intel-iommu SLADS from specs looks quite similar. Happy
>> to join existing effort anyways.
> 
> This sounds great, I would love to bring the AMD IOMMU along with a
> dirty tracking implementation! Can you share some patches so we can
> see what the HW implementation looks like?

Oh yes for sure! As I said I'm happy to help&implement along.
I would really love to leverage the IOMMU feature, as that relieves the
migrateable PCI device having to do the dirty tracking themselves.

And well, we seem to be getting there -- spec-wise everybody has that
feature *at least* documented :)

Give me some time (few days only, as I gotta sort some things) and I'll
respond here as follow up with link to a branch with the WIP/PoC patches.

Summing up a couple of remarks on hw, hopefully they enlight:

1) On AMD the feature is advertised by their extended feature register
as supported or not. On Intel, same in its equivalent (ECAP). On course
different bits on different IOMMUs registers. If it is supported, the
IOMMU updates (when activated) two bits per page table entry to indicate
'access' and 'dirty'. Slightly different page table format bit-wise on
where access/dirty is located (between the two vendors).

2) Change protection domain flags to enable the dirty/access tracking.
On AMD, it's the DTE flags (a new flag for dirty tracking).
[Intel does this in the PASID table entry]

3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
bit is cheap, clearing them is 'expensive' because one needs to flush
IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
operations to actually update the Access/Dirty bit in concurrency with
the CPU. The AMD manuals are a tad misleading as they talk about marking
non-present, but that would be catastrophic for migration as it would
mean a DMA target abort for the PCI device, unless I missed something obvious.
In any case, this means that the dirty bit *clearing* needs to be
batched as much as possible, to amortize the cost of flushing the IOTLB.
This is the same for Intel *IIUC*.

4) Adjust the granularity of pagetables in place:
[This item wasn't done, but it is generic to any IOMMU because it
is mostly the ability to split existing IO pages in place.]

4.a) Eager splitting and late collapsing IO hugepages -- essentially
to have tracking at finer granularity. as you know generally the
IOMMU map is generally done on the biggest IO page size, specially on
guests. We sadly can't write-protect or anything fancier that we
usually do so otherwise the PCI device gets a DMA target-abort :(
So splitting, when we start dirty tracking (what I mean by eager), and
do for the whole IO page table. When finished tracking, collapse the
pages (which should probably be deemed optional in the common case of
succeeding the migration). Guest expected to have higher IOTLB miss.
This part is particularly worrying for the IO guest performance, but
hopefully it doesn't turn out to be too bad.

4.b) Optionally starting dirtying earlier (at provisioning) and let
userspace dynamically split pages. This is to hopefully minimize the
IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
So dirty tracking would be enabled at creation of the protection domain
after the vfio container is set up, and we would use pages dirtied
as a indication of what needs to be splited. Problem is for IO page
sizes bigger than 1G, which might unnecessarily lead to marking too
much as dirty early on; but at least it's better than transferring the
whole set.

	Joao
Jason Gunthorpe Feb. 4, 2022, 11:07 p.m. UTC | #11
On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
> On 2/3/22 15:18, Jason Gunthorpe wrote:
> > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
> >> On 2/2/22 17:03, Jason Gunthorpe wrote:
> >>> how to integrate that with the iommufd work, which I hope will allow
> >>> that series, and the other IOMMU drivers that can support this to be
> >>> merged..
> >>
> >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
> >> there, but TBH I am not up to speed on iommu-fd yet so I missed something
> >> obvious for sure. When you say 'integrate that with the iommufd' can you
> >> expand on that?
> > 
> > The general idea is that iommufd is the place to put all the iommu
> > driver uAPI for consumption by userspace. The IOMMU feature of dirty
> > tracking would belong there.
> > 
> > So, some kind of API needs to be designed to meet the needs of the
> > IOMMU drivers.
> > 
> /me nods
> 
> I am gonna assume below is the most up-to-date to iommufd (as you pointed
> out in another thread IIRC):
> 
>   https://github.com/jgunthorpe/linux iommufd
> 
> Let me know if it's not :)

The iommufd part is pretty good, but there is hacky patch to hook it
into vfio that isn't there, if you want to actually try it.

> > But, as you say, it looks unnatural and inefficient when the domain
> > itself is storing the dirty bits inside the IOPTE.
> 
> How much is this already represented as the io-pgtable in IOMMU internal kAPI
> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(

Which are you looking at? AFACIT there is no diry page support in
iommu_ops ?

> then potentially VMM/process can more efficiently scan the dirtied
> set? But if some layer needs to somehow mediate between the vendor
> IOPTE representation and an UAPI IOPTE representation, to be able to
> make that delegation to userspace ... then maybe both might be
> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
> lookup as far as I glanced through the code, perhaps that's another
> ioctl().

It is based around the same model as VFIO container - map/unmap of
user address space into the IOPTEs and the user space doesn't see
anything resembling a 'pte' - at least for kernel owned IO page
tables.

User space page tables will not be abstracted and the userspace must
know the direct HW format of the IOMMU they is being used.

> But what strikes /specifically/ on the dirty bit feature is that it looks
> simpler with the current VFIO, the heavy lifting seems to be
> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
> the container (small changes), and IOMMU vendor would do most of it:

It is basically the same, almost certainly the user API in iommufd
will be some 'get dirty bits' and 'unmap and give me the dirty bits'
just like vfio has.
 
The tricky details are around how do you manage this when the system
may have multiple things invovled capable, or not, of actualy doing
dirty tracking.

> At the same time, what particularly scares me perf-wise (for the
> device being migrated) ... is the fact that we need to dynamically
> split and collapse page tables to increase the granularity of which
> we track. In the above interface it splits/collapses when you turn
> on/off the dirty tracking (respectively). That's *probably* where we
> need more flexibility, not sure.

For sure that is a particularly big adventure in the iommu driver..

> Do you have thoughts on what such device-dirty interface could look like?
> (Perhaps too early to poke while the FSM/UAPI is being worked out)

I've been thinking the same general read-and-clear of a dirty
bitmap. It matches nicely the the KVM interface.

> I was wondering if container has a dirty scan/sync callback funnelled
> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 

Yes, this is almost certainly how the in-kernel parts will look

> and vfio vendor driver provides one per device. 

But this is less clear..

> Or propagate the dirty tracking API to vendor vfio driver[*]. 
> [*] considering the device may choose where to place its tracking storage, and
> which scheme (bitmap, ring, etc) it might be.

This has been my thinking, yes

> The reporting of the dirtying, though, looks hazzy to achieve if you
> try to make it uniform even to userspace. Perhaps with iommu-fd
> you're thinking to mmap() the dirty region back to userspace, or an
> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
> the dirty status via the mmap() object. And that would be the common
> API regardless of dirty-hw scheme. Anyway, just thinking out loud.

My general thinking has be that iommufd would control only the system
IOMMU hardware. The FD interface directly exposes the iommu_domain as
a manipulable object, so I'd imagine making userspace have a simple
1:1 connection to the iommu_ops of a single iommu_domain.

Doing this avoids all the weirdo questions about what do you do if
there is non-uniformity in the iommu_domain's.

Keeping with that theme the vfio_device would provide a similar
interface, on its own device FD.

I don't know if mmap should be involed here, the dirty bitmaps are not
so big, I suspect a simple get_user_pages_fast() would be entirely OK.

> > VFIO proposed to squash everything
> > into the container code, but I've been mulling about having iommufd
> > only do system iommu and push the PCI device internal tracking over to
> > VFIO.
> > 
> 
> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
> not sure yet how much one can we 'offload' to a generic layer, at least
> compared with this other proposal.

Yes, I expect there is very little generic code here if we go this
way. The generic layer is just marshalling the ioctl(s) to the iommu
drivers. Certainly not providing storage or anything/

> Give me some time (few days only, as I gotta sort some things) and I'll
> respond here as follow up with link to a branch with the WIP/PoC patches.

Great!
  
> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
> bit is cheap, clearing them is 'expensive' because one needs to flush
> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
> operations to actually update the Access/Dirty bit in concurrency with
> the CPU. The AMD manuals are a tad misleading as they talk about marking
> non-present, but that would be catastrophic for migration as it would
> mean a DMA target abort for the PCI device, unless I missed something obvious.
> In any case, this means that the dirty bit *clearing* needs to be
> batched as much as possible, to amortize the cost of flushing the IOTLB.
> This is the same for Intel *IIUC*.

You have to mark it as non-present to do the final read out if
something unmaps while the tracker is on - eg emulating a viommu or
something. Then you mark non-present, flush the iotlb and read back
the dirty bit.

Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
then read and clear them.

> 4) Adjust the granularity of pagetables in place:
> [This item wasn't done, but it is generic to any IOMMU because it
> is mostly the ability to split existing IO pages in place.]

This seems like it would be some interesting amount of driver work,
but yes it could be a generic new iommu_domina op.

> 4.b) Optionally starting dirtying earlier (at provisioning) and let
> userspace dynamically split pages. This is to hopefully minimize the
> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
> So dirty tracking would be enabled at creation of the protection domain
> after the vfio container is set up, and we would use pages dirtied
> as a indication of what needs to be splited. Problem is for IO page
> sizes bigger than 1G, which might unnecessarily lead to marking too
> much as dirty early on; but at least it's better than transferring the
> whole set.

I'm not sure running with dirty tracking permanently on would be good
for guest performance either.

I'd suspect you'd be better to have a warm up period where you track
dirtys and split down pages.

It is interesting, this is a possible reason why device dirty tracking
might actually perfom better because it can operate at a different
granularity from the system iommu without disrupting the guest DMA
performance.

Jason
Joao Martins Feb. 11, 2022, 5:28 p.m. UTC | #12
On 2/4/22 23:07, Jason Gunthorpe wrote:
> On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
>> On 2/3/22 15:18, Jason Gunthorpe wrote:
>>> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>>>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>>>> how to integrate that with the iommufd work, which I hope will allow
>>>>> that series, and the other IOMMU drivers that can support this to be
>>>>> merged..
>>>>
>>>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>>>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>>>> obvious for sure. When you say 'integrate that with the iommufd' can you
>>>> expand on that?
>>>
>>> The general idea is that iommufd is the place to put all the iommu
>>> driver uAPI for consumption by userspace. The IOMMU feature of dirty
>>> tracking would belong there.
>>>
>>> So, some kind of API needs to be designed to meet the needs of the
>>> IOMMU drivers.
>>>
>> /me nods
>>
>> I am gonna assume below is the most up-to-date to iommufd (as you pointed
>> out in another thread IIRC):
>>
>>   https://github.com/jgunthorpe/linux iommufd
>>
>> Let me know if it's not :)
> 
> The iommufd part is pretty good, but there is hacky patch to hook it
> into vfio that isn't there, if you want to actually try it.
> 
OK -- Thanks for the heads up.

>>> But, as you say, it looks unnatural and inefficient when the domain
>>> itself is storing the dirty bits inside the IOPTE.
>>
>> How much is this already represented as the io-pgtable in IOMMU internal kAPI
>> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
>> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(
> 
> Which are you looking at? AFACIT there is no diry page support in
> iommu_ops ?
> 
There is not, indeed. But how you manage IO page tables felt *somewhat
hinting* to io-pgtable -- but maybe it's a stretch. The dirty
page support builds on top of that, as ARM IOMMUs use iommu io-pgtable.

In the series in the link, it pretty much boils down to
essentially 3 iommu_ops introduced:

* support_dirty_log(domain)
  - Checks whether a given IOMMU domain supports dirty tracking
* switch_dirty_log(domain, enable, iova, size, prot)
  - Enables/Disables IOMMU dirty tracking on an IOVA range
  (The range I suppose it's because ARM supports range tracking, contrary
   to Intel and AMD which affect all the protection domain pg tables)
* sync_dirty_log(domain, iova, size, &bitmap, ...)

The IO pgtables, have an interesting set of APIs strictly for page table
manipulation, and they are sort of optional (so far ARM and AMD). Those
include, map, unmap, iova_to_pfn translation. So, the dirty log supports
for ARM builds on that and adds split, merge (for splitting and collapsing
IO pagtables), and finally sync_dirty_log() which is where we get the
bitmap of dirtied pages *updated*.

But well, at the end of the day for an IOMMU driver the domain ops are
the important stuff, maybe IO pgtable framework isn't as critical
(Intel for example, doesn't use that at all).

>> then potentially VMM/process can more efficiently scan the dirtied
>> set? But if some layer needs to somehow mediate between the vendor
>> IOPTE representation and an UAPI IOPTE representation, to be able to
>> make that delegation to userspace ... then maybe both might be
>> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
>> lookup as far as I glanced through the code, perhaps that's another
>> ioctl().
> 
> It is based around the same model as VFIO container - map/unmap of
> user address space into the IOPTEs and the user space doesn't see
> anything resembling a 'pte' - at least for kernel owned IO page
> tables.
> 
Ah! I misinterpreted what you were saying.

> User space page tables will not be abstracted and the userspace must
> know the direct HW format of the IOMMU they is being used.
> 
That's countering earlier sentence? Because hw format (for me at least)
means PTE and protection domain config format too. And if iommufd
abstracts the HW format, modelling after the IOMMU domain and its ops,
then it's abstracting userspace from those details e.g. it works over
IOVAs, but not over its vendor representation of how that IOVA is set up.

I am probably being dense.

>> But what strikes /specifically/ on the dirty bit feature is that it looks
>> simpler with the current VFIO, the heavy lifting seems to be
>> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
>> the container (small changes), and IOMMU vendor would do most of it:
> 
> It is basically the same, almost certainly the user API in iommufd
> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> just like vfio has.
> 

The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
migration flow. Considering that we pause vCPUs, and we quiesce the
VFs, I guess we are mostly supposed to have a guarantee that no DMA is
supposed to be happening (excluding P2P)? So perhaps the unmap is
unneeded after quiescing the VF. Which probably is important
migration-wise considering the unmap is really what's expensive (e.g.
needs IOTLB flush) and that would add up between stop copy and the time it
would resume on the destination, if we need to wait to unmap the whole IOVA
from source.

> The tricky details are around how do you manage this when the system
> may have multiple things invovled capable, or not, of actualy doing
> dirty tracking.
> 
Yeap.

>> At the same time, what particularly scares me perf-wise (for the
>> device being migrated) ... is the fact that we need to dynamically
>> split and collapse page tables to increase the granularity of which
>> we track. In the above interface it splits/collapses when you turn
>> on/off the dirty tracking (respectively). That's *probably* where we
>> need more flexibility, not sure.
> 
> For sure that is a particularly big adventure in the iommu driver..
> 
Yeah.

Me playing around was with VFIO IOMMU hugepages disabled, and this
was something I am planning on working when I get back to this.

>> Do you have thoughts on what such device-dirty interface could look like?
>> (Perhaps too early to poke while the FSM/UAPI is being worked out)
> 
> I've been thinking the same general read-and-clear of a dirty
> bitmap. It matches nicely the the KVM interface.
> 

A bitmap I think it is a good start.

We have a bitmap based interface in KVM, but there's also a recent ring
interface for dirty tracking, which has some probably more determinism than
a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
and breaking its entries on-demand IIRC, whereas Intel resembles something
closer to a 512 entries 'ring' with VMX PML, which tells what has been
dirtied.

>> I was wondering if container has a dirty scan/sync callback funnelled
>> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 
> 
> Yes, this is almost certainly how the in-kernel parts will look
> 
/me nods

>> and vfio vendor driver provides one per device. 
> 
> But this is less clear..
> 
>> Or propagate the dirty tracking API to vendor vfio driver[*]. 
>> [*] considering the device may choose where to place its tracking storage, and
>> which scheme (bitmap, ring, etc) it might be.
> 
> This has been my thinking, yes
> 
/me nods

>> The reporting of the dirtying, though, looks hazzy to achieve if you
>> try to make it uniform even to userspace. Perhaps with iommu-fd
>> you're thinking to mmap() the dirty region back to userspace, or an
>> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
>> the dirty status via the mmap() object. And that would be the common
>> API regardless of dirty-hw scheme. Anyway, just thinking out loud.
> 
> My general thinking has be that iommufd would control only the system
> IOMMU hardware. The FD interface directly exposes the iommu_domain as
> a manipulable object, so I'd imagine making userspace have a simple
> 1:1 connection to the iommu_ops of a single iommu_domain.
> 
> Doing this avoids all the weirdo questions about what do you do if
> there is non-uniformity in the iommu_domain's.
> 
> Keeping with that theme the vfio_device would provide a similar
> interface, on its own device FD.
> 
> I don't know if mmap should be involed here, the dirty bitmaps are not
> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> 
Considering that is 32MB of a bitmap per TB maybe it is cheap.

>>> VFIO proposed to squash everything
>>> into the container code, but I've been mulling about having iommufd
>>> only do system iommu and push the PCI device internal tracking over to
>>> VFIO.
>>>
>>
>> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
>> not sure yet how much one can we 'offload' to a generic layer, at least
>> compared with this other proposal.
> 
> Yes, I expect there is very little generic code here if we go this
> way. The generic layer is just marshalling the ioctl(s) to the iommu
> drivers. Certainly not providing storage or anything/
> 
Gotcha. Perhaps I need to sort out how this would work with iommufd.

But I guess we can hash out how the iommu ops would look like?

We seem to be on the same page, at least that's the feeling I get.

>> Give me some time (few days only, as I gotta sort some things) and I'll
>> respond here as follow up with link to a branch with the WIP/PoC patches.
> 
> Great!
>
Here it is. "A few days" turn into a week sorry :/

https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip

Note, it is an early PoC. I still need to get the split/collapse thing going,
and fix the FIXMEs there, and have a second good look at the iommu page tables.

The Qemu link above has one patch to get device-dirty bitmaps via HMP to be able to
measure the rate by hw dirties and another patch for emulated amd iommu support too,
should you lack the hardware to play (I usually do on both, more for people to be
able to coverage test it). I sadly cannot do the end-to-end migration test.

>> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
>> bit is cheap, clearing them is 'expensive' because one needs to flush
>> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
>> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
>> operations to actually update the Access/Dirty bit in concurrency with
>> the CPU. The AMD manuals are a tad misleading as they talk about marking
>> non-present, but that would be catastrophic for migration as it would
>> mean a DMA target abort for the PCI device, unless I missed something obvious.
>> In any case, this means that the dirty bit *clearing* needs to be
>> batched as much as possible, to amortize the cost of flushing the IOTLB.
>> This is the same for Intel *IIUC*.
> 
> You have to mark it as non-present to do the final read out if
> something unmaps while the tracker is on - eg emulating a viommu or
> something. Then you mark non-present, flush the iotlb and read back
> the dirty bit.
> 
You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
too :) without needing VMM intervention (that's also not supported
in Linux).

But the mark as NP for viommu is something I haven't investigated.

> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> then read and clear them.
> 
It's the other way around AIUI. The dirty bits are sticky, so you flush
the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
again on the next memory transaction (or ATS translation).

I am not entirely sure we need to unmap + mark non-present for non-viommu
That would actually mean something is not properly quiscieing the VF DMA.
Maybe we should .. to gate whether if we should actually continue with LM
if something kept doing DMA when it shouldn't have.

Also, considering the dirty-bitmap check happens on very-very long IOVA extents --
some measurements over how long the scans need to be done. Mostly
validated how much a much NIC was dirtying with traditional networking tools.

>> 4) Adjust the granularity of pagetables in place:
>> [This item wasn't done, but it is generic to any IOMMU because it
>> is mostly the ability to split existing IO pages in place.]
> 
> This seems like it would be some interesting amount of driver work,
> but yes it could be a generic new iommu_domina op.
> 
I am slightly at odds that .split and .collapse at .switch() are enough.
But, with iommu if we are working on top of an IOMMU domain object and
.split and .collapse are iommu_ops perhaps that looks to be enough
flexibility to give userspace the ability to decide what it wants to
split, if it starts eargerly/warming-up tracking dirty pages.

The split and collapsing is something I wanted to work on next, to get
to a stage closer to that of an RFC on the AMD side.

>> 4.b) Optionally starting dirtying earlier (at provisioning) and let
>> userspace dynamically split pages. This is to hopefully minimize the
>> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
>> So dirty tracking would be enabled at creation of the protection domain
>> after the vfio container is set up, and we would use pages dirtied
>> as a indication of what needs to be splited. Problem is for IO page
>> sizes bigger than 1G, which might unnecessarily lead to marking too
>> much as dirty early on; but at least it's better than transferring the
>> whole set.
> 
> I'm not sure running with dirty tracking permanently on would be good
> for guest performance either.
> 
Hmmm, judging how the IOMMU works I am not sure this is particularly
affecting DMA performance (not sure yet about RDMA, it's something I
curious to see how it gets to perform with 4K IOPTEs, and with dirty
tracking always enabled). Considering how the bits are sticky, and
unless CPU clears it, it's short of a nop? Unless of course the checking
for A^D during an atomic memory transaction is expensive. Needs some
performance testing nonetheless.

The IOTLB flushes pre/after clearing, tough, would visibly hurt DMA
performance :(

I forgot to mention, but the early enablement of IOMMU dirty tracking
was also meant to fully know since guest creation what needs to be
sent to the destination. Otherwise, wouldn't we need to send the whole
pinned set to destination, if we only start tracking dirty pages during
migration?

> I'd suspect you'd be better to have a warm up period where you track
> dirtys and split down pages.
> 
Yeah, but right now the splitting is done when switching on and off
dirty tracking, for the entirety of page tables.

And well at least on AMD, the flag is attached to a protection domain.
For SMMUv3. it appears that it's per IOVA range if I read the original code
correctly (which is interesting contrast to AMD/Intel)

Also, this is probably a differentiator for iommufd, if we were to provide
split and collapse semantics to IOMMU domain objects that userspace can use.
That would get more freedom, to switch dirty-tracking, and then do the warm
up thingie and piggy back on what it wants to split before migration.
perhaps the switch() should get some flag to pick where to split, I guess.


> It is interesting, this is a possible reason why device dirty tracking
> might actually perfom better because it can operate at a different
> granularity from the system iommu without disrupting the guest DMA
> performance.

I agree -- and the other advantage is that you don't depend on platform
support for dirty tracking, which is largely nonexistent.

Albeit I personally like to have the IOMMU support because it also
frees endpoints to do rather expensive dirty tracking, which is a rather
even more exclusive feature to have. It's trading off max performance for
LM commodity over, where DMA performance can be less (momentarily).

	Joao
Jason Gunthorpe Feb. 11, 2022, 5:49 p.m. UTC | #13
On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:

> But well, at the end of the day for an IOMMU driver the domain ops are
> the important stuff, maybe IO pgtable framework isn't as critical
> (Intel for example, doesn't use that at all).

Right, it doesn't matter what library was used to implement the
domain..

> > User space page tables will not be abstracted and the userspace must
> > know the direct HW format of the IOMMU they is being used.
> > 
> That's countering earlier sentence? Because hw format (for me at least)
> means PTE and protection domain config format too. And if iommufd
> abstracts the HW format, modelling after the IOMMU domain and its ops,
> then it's abstracting userspace from those details e.g. it works over
> IOVAs, but not over its vendor representation of how that IOVA is set up.
> 
> I am probably being dense.

It is both ways, one kind of domain provides a kernel supplied
map/unmap that implements the IO PTE manipulation in kernel memory

But if you want the IOPTE to be in user memory then user must
read/write it and it cannot use that - so a user domain will not have
map/unmap.

> > It is basically the same, almost certainly the user API in iommufd
> > will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> > just like vfio has.
> > 
> 
> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
> migration flow. 

It is essential to implement any kind of viommu behavior where
map/unmap is occuring while the dirty tracking is running. It should
never make a difference except in some ugly edge cases where the dma
and unmap are racing.

> supposed to be happening (excluding P2P)? So perhaps the unmap is
> unneeded after quiescing the VF.

Yes, you don't need to unmap for migration, a simple fully synchronous
read and clear is sufficient. But that final read, while DMA is quite,
must obtain the latest dirty bit data.
 
> We have a bitmap based interface in KVM, but there's also a recent ring
> interface for dirty tracking, which has some probably more determinism than
> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
> and breaking its entries on-demand IIRC, whereas Intel resembles something
> closer to a 512 entries 'ring' with VMX PML, which tells what has been
> dirtied.

KVM has an advantage that it can throttle the rate of dirty generation
so it can rely on logging. The IOMMU can't throttle DMA, so we are
stuck with a bitmap

> > I don't know if mmap should be involed here, the dirty bitmaps are not
> > so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> > 
> Considering that is 32MB of a bitmap per TB maybe it is cheap.

Rigt. GUP fasting a couple huge pages is nothing compared to scanning
1TB of IO page table.

> >> Give me some time (few days only, as I gotta sort some things) and I'll
> >> respond here as follow up with link to a branch with the WIP/PoC patches.
> > 
> > Great!
> >
> Here it is. "A few days" turn into a week sorry :/
> 
> https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
> https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip
> 
> Note, it is an early PoC. I still need to get the split/collapse thing going,
> and fix the FIXMEs there, and have a second good look at the iommu page tables.

Oh I'll try to look a this thanks

> > You have to mark it as non-present to do the final read out if
> > something unmaps while the tracker is on - eg emulating a viommu or
> > something. Then you mark non-present, flush the iotlb and read back
> > the dirty bit.
> > 
> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
> too :) without needing VMM intervention (that's also not supported
> in Linux).

I'm sure, but dirty tracking has to happen on the kernel owned page
table, not the user owned one I think..

> > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> > then read and clear them.
> > 
> It's the other way around AIUI. The dirty bits are sticky, so you flush
> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
> again on the next memory transaction (or ATS translation).

I guess it depends on how the HW works, if it writes the dirty bit
back to ram instantly on the first dirty, or if it only writes the
dirty bit when flushing the iotlb.

In any case you have to synchronize with the HW in some way to ensure
that all dirty bits are 'pulled back' into system memory to resolve
races (ie you need the iommu HW to release and the CPU to acquire on
the dirty data) before trying to read them, at least for the final
quieced system read.

> I am not entirely sure we need to unmap + mark non-present for non-viommu
> That would actually mean something is not properly quiscieing the VF DMA.
> Maybe we should .. to gate whether if we should actually continue with LM
> if something kept doing DMA when it shouldn't have.

It is certainly an edge case. A device would be misbehaving to
continue DMA.

> > This seems like it would be some interesting amount of driver work,
> > but yes it could be a generic new iommu_domina op.
> 
> I am slightly at odds that .split and .collapse at .switch() are enough.
> But, with iommu if we are working on top of an IOMMU domain object and
> .split and .collapse are iommu_ops perhaps that looks to be enough
> flexibility to give userspace the ability to decide what it wants to
> split, if it starts eargerly/warming-up tracking dirty pages.
> 
> The split and collapsing is something I wanted to work on next, to get
> to a stage closer to that of an RFC on the AMD side.

split/collapse seems kind of orthogonal to me it doesn't really
connect to dirty tracking other than being mostly useful during dirty
tracking.

And I wonder how hard split is when trying to atomically preserve any
dirty bit..

> Hmmm, judging how the IOMMU works I am not sure this is particularly
> affecting DMA performance (not sure yet about RDMA, it's something I
> curious to see how it gets to perform with 4K IOPTEs, and with dirty
> tracking always enabled). Considering how the bits are sticky, and
> unless CPU clears it, it's short of a nop? Unless of course the checking
> for A^D during an atomic memory transaction is expensive. Needs some
> performance testing nonetheless.

If you leave the bits all dirty then why do it? The point is to see
the dirties, which means the iommu is generating a workload of dirty
cachelines while operating it didn't do before.

> I forgot to mention, but the early enablement of IOMMU dirty tracking
> was also meant to fully know since guest creation what needs to be
> sent to the destination. Otherwise, wouldn't we need to send the whole
> pinned set to destination, if we only start tracking dirty pages during
> migration?

? At the start of migration you have to send everything. Dirty
tracking is intended to allow the VM to be stopped and then have only
a small set of data to xfer.
 
> Also, this is probably a differentiator for iommufd, if we were to provide
> split and collapse semantics to IOMMU domain objects that userspace can use.
> That would get more freedom, to switch dirty-tracking, and then do the warm
> up thingie and piggy back on what it wants to split before migration.
> perhaps the switch() should get some flag to pick where to split, I guess.

Yes, right. Split/collapse should be completely seperate from dirty
tracking.

Jason
Joao Martins Feb. 11, 2022, 9:43 p.m. UTC | #14
On 2/11/22 17:49, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote:
>>> It is basically the same, almost certainly the user API in iommufd
>>> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
>>> just like vfio has.
>>>
>>
>> The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
>> migration flow. 
> 
> It is essential to implement any kind of viommu behavior where
> map/unmap is occuring while the dirty tracking is running. It should
> never make a difference except in some ugly edge cases where the dma
> and unmap are racing.
> 
/me nods

>> supposed to be happening (excluding P2P)? So perhaps the unmap is
>> unneeded after quiescing the VF.
> 
> Yes, you don't need to unmap for migration, a simple fully synchronous
> read and clear is sufficient. But that final read, while DMA is quite,
> must obtain the latest dirty bit data.
>

The final read doesn't need special logic in VFIO/IOMMU IUC
(maybe only in the VMM)

Anyways, the above paragraph matches my understanding.

>> We have a bitmap based interface in KVM, but there's also a recent ring
>> interface for dirty tracking, which has some probably more determinism than
>> a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
>> and breaking its entries on-demand IIRC, whereas Intel resembles something
>> closer to a 512 entries 'ring' with VMX PML, which tells what has been
>> dirtied.
> 
> KVM has an advantage that it can throttle the rate of dirty generation
> so it can rely on logging. The IOMMU can't throttle DMA, so we are
> stuck with a bitmap
> 
Yeap, sadly :(

>>> I don't know if mmap should be involed here, the dirty bitmaps are not
>>> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
>>>
>> Considering that is 32MB of a bitmap per TB maybe it is cheap.
> 
> Rigt. GUP fasting a couple huge pages is nothing compared to scanning
> 1TB of IO page table.
> 
... With 4K PTEs which is even more ludricous expensive. Well yeah, a
lesser concern the mangling of the bitmap, when you put that way heh

>>> You have to mark it as non-present to do the final read out if
>>> something unmaps while the tracker is on - eg emulating a viommu or
>>> something. Then you mark non-present, flush the iotlb and read back
>>> the dirty bit.
>>>
>> You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
>> too :) without needing VMM intervention (that's also not supported
>> in Linux).
> 
> I'm sure, but dirty tracking has to happen on the kernel owned page
> table, not the user owned one I think..
> 
The plumbing for the hw-accelerated vIOMMU is a little different that
a regular vIOMMU, at least IIUC host does not take an active part in the
GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
have time to fiddle with the SDM:

https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf


>>> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
>>> then read and clear them.
>>>
>> It's the other way around AIUI. The dirty bits are sticky, so you flush
>> the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
>> again on the next memory transaction (or ATS translation).
> 
> I guess it depends on how the HW works, if it writes the dirty bit
> back to ram instantly on the first dirty, or if it only writes the
> dirty bit when flushing the iotlb.
> 
The manual says roughly that the update is visible to CPU as soon as the
it updates. Particularly from the IOMMU SDM:

"Note that the setting of accessed and dirty status bits in the page tables is visible to
both the CPU and the peripheral when sharing guest page tables. The IOMMU interlocked
operations to update A and D bits must be 64-bit operations and naturally aligned on a
64-bit boundary"

And ...

1. Decodes the read and write intent from the memory access.
2. If P=0 in the page descriptor, fail the access.
3. Compare the A & D bits in the descriptor with the read and write intent in the request.
4. If the A or D bits need to be updated in the descriptor:
* Start atomic operation.
* Read the descriptor as a 64-bit access.
* If the descriptor no longer appears to require an update, release the atomic lock with
no further action and continue to step 5.
* Calculate the new A & D bits.
* Write the descriptor as a 64-bit access.
* End atomic operation.
5. Continue to the next stage of translation or to the memory access.

> In any case you have to synchronize with the HW in some way to ensure
> that all dirty bits are 'pulled back' into system memory to resolve
> races (ie you need the iommu HW to release and the CPU to acquire on
> the dirty data) before trying to read them, at least for the final
> quieced system read.
> 
/me nods

>>> This seems like it would be some interesting amount of driver work,
>>> but yes it could be a generic new iommu_domina op.
>>
>> I am slightly at odds that .split and .collapse at .switch() are enough.
>> But, with iommu if we are working on top of an IOMMU domain object and
>> .split and .collapse are iommu_ops perhaps that looks to be enough
>> flexibility to give userspace the ability to decide what it wants to
>> split, if it starts eargerly/warming-up tracking dirty pages.
>>
>> The split and collapsing is something I wanted to work on next, to get
>> to a stage closer to that of an RFC on the AMD side.
> 
> split/collapse seems kind of orthogonal to me it doesn't really
> connect to dirty tracking other than being mostly useful during dirty
> tracking.
> 
> And I wonder how hard split is when trying to atomically preserve any
> dirty bit..
> 
Would would it be hard? The D bit is supposed to be replicated when you
split to smaller page size.

>> Hmmm, judging how the IOMMU works I am not sure this is particularly
>> affecting DMA performance (not sure yet about RDMA, it's something I
>> curious to see how it gets to perform with 4K IOPTEs, and with dirty
>> tracking always enabled). Considering how the bits are sticky, and
>> unless CPU clears it, it's short of a nop? Unless of course the checking
>> for A^D during an atomic memory transaction is expensive. Needs some
>> performance testing nonetheless.
> 
> If you leave the bits all dirty then why do it? The point is to see
> the dirties, which means the iommu is generating a workload of dirty
> cachelines while operating it didn't do before.
> 
My thinking was that if it's dirtied and in the IOTLB most likely the
descriptor in the IOTLB is cached. And if you need to do a IOMMU page walk
to resolve an IOVA, perhaps the check for the A & D bits needing
to be updated is probably the least problem in this path. Naturally, if
it's not split, you have a much higher chance (e.g. with 1GB entries) to stay
in the IOTLB and just compare two bits *prior* to consider starting
the atomic operation to update the descriptor.

>> I forgot to mention, but the early enablement of IOMMU dirty tracking
>> was also meant to fully know since guest creation what needs to be
>> sent to the destination. Otherwise, wouldn't we need to send the whole
>> pinned set to destination, if we only start tracking dirty pages during
>> migration?
> 
> ? At the start of migration you have to send everything. Dirty
> tracking is intended to allow the VM to be stopped and then have only
> a small set of data to xfer.
>  
Right, that's how it works today.

This is just preemptive longterm thinking about the overal problem space (probably
unnecessary noise at this stage). Particularly whenever I need to migrate 1 to 2TB VMs.
Particular that the stage *prior* to precopy takes way too long to transfer the whole
memory. So I was thinking say only transfer the pages that are populated[*] in the
second-stage page tables (for the CPU) coupled with IOMMU tracking from the beginning
(prior to vcpus even entering). That could probably decrease 1024 1GB Dirtied IOVA
entries, to maybe only dirty a smaller subset, saving a whole bootload of time.

[*] VMs without VFIO would be even easier as the first stage page tables are non-present.

>> Also, this is probably a differentiator for iommufd, if we were to provide
>> split and collapse semantics to IOMMU domain objects that userspace can use.
>> That would get more freedom, to switch dirty-tracking, and then do the warm
>> up thingie and piggy back on what it wants to split before migration.
>> perhaps the switch() should get some flag to pick where to split, I guess.
> 
> Yes, right. Split/collapse should be completely seperate from dirty
> tracking.

Yeap.

I wonder if we could start progressing the dirty tracking as a first initial series and
then have the split + collapse handling as a second part? That would be quite
nice to get me going! :D

	Joao
Jason Gunthorpe Feb. 12, 2022, 12:01 a.m. UTC | #15
On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
> The plumbing for the hw-accelerated vIOMMU is a little different that
> a regular vIOMMU, at least IIUC host does not take an active part in the
> GVA -> GPA translation. Suravee's preso explains it nicely, if you don't
> have time to fiddle with the SDM:
> 
> https://static.sched.com/hosted_files/kvmforum2021/da/vIOMMU%20KVM%20Forum%202021%20-%20v4.pdf

That looks like the same nesting everyone else is talking about.

I've been refering to this as 'user space page tables' to avoid a lot
of ambiguity becuase what it fundamentally does is allow userspace to
directly manage an IO page table - building the IO PTEs, issue
invalidations, etc.

To be secure the userspace page table is nested under a kernel owned
page table and all the IO PTE offsets/etc are understood to be within
the address space of the kernel owned page table.

When combined with KVM the userspace is just inside a VM.

> 1. Decodes the read and write intent from the memory access.
> 2. If P=0 in the page descriptor, fail the access.
> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
> 4. If the A or D bits need to be updated in the descriptor:

Ah, so the dirty update is actually atomic on the first write before
any DMA happens - and I suppose all of this happens when the entry is
first loaded into the IOTLB.

So the flush is to allow the IOTLB to see the cleared D bit..

> > split/collapse seems kind of orthogonal to me it doesn't really
> > connect to dirty tracking other than being mostly useful during dirty
> > tracking.
> > 
> > And I wonder how hard split is when trying to atomically preserve any
> > dirty bit..
> > 
> Would would it be hard? The D bit is supposed to be replicated when you
> split to smaller page size.

I guess it depends on how the 'acquire' is done, as the CPU has to
atomically replace a large entry with a pointer to a small entry,
flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
in the old entry then it has to sprinkle it into the new entries with
atomics.

> This is just preemptive longterm thinking about the overal problem
> space (probably unnecessary noise at this stage). Particularly
> whenever I need to migrate 1 to 2TB VMs.  Particular that the stage
> *prior* to precopy takes way too long to transfer the whole
> memory. So I was thinking say only transfer the pages that are
> populated[*] in the second-stage page tables (for the CPU) coupled
> with IOMMU tracking from the beginning (prior to vcpus even
> entering). That could probably decrease 1024 1GB Dirtied IOVA
> entries, to maybe only dirty a smaller subset, saving a whole
> bootload of time.

Oh, you want to effectively optimize zero page detection..

> I wonder if we could start progressing the dirty tracking as a first initial series and
> then have the split + collapse handling as a second part? That would be quite
> nice to get me going! :D

I think so, and I think we should. It is such a big problem space, it
needs to get broken up.

Jason
Joao Martins Feb. 14, 2022, 1:34 p.m. UTC | #16
On 2/12/22 00:01, Jason Gunthorpe wrote:
> On Fri, Feb 11, 2022 at 09:43:56PM +0000, Joao Martins wrote:
>> 1. Decodes the read and write intent from the memory access.
>> 2. If P=0 in the page descriptor, fail the access.
>> 3. Compare the A & D bits in the descriptor with the read and write intent in the request.
>> 4. If the A or D bits need to be updated in the descriptor:
> 
> Ah, so the dirty update is actually atomic on the first write before
> any DMA happens - and I suppose all of this happens when the entry is
> first loaded into the IOTLB.
> 
Intel's equivalent feature suggests me that this works the same way there.

ARM update of ioptes looks to be slightly different[*] but this FEAT_BBM
in SMMUv3.2 makes it work in similar way to Intel/AMD. But I could be
misunderstanding something there.

[*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
and then write the new valid entry. Not sure I understood correctly that this
is the 'break-before-make' thingie.

> So the flush is to allow the IOTLB to see the cleared D bit..
> 
Right.

>>> split/collapse seems kind of orthogonal to me it doesn't really
>>> connect to dirty tracking other than being mostly useful during dirty
>>> tracking.
>>>
>>> And I wonder how hard split is when trying to atomically preserve any
>>> dirty bit..
>>>
>> Would would it be hard? The D bit is supposed to be replicated when you
>> split to smaller page size.
> 
> I guess it depends on how the 'acquire' is done, as the CPU has to
> atomically replace a large entry with a pointer to a small entry,
> flush the IOTLB then 'acquire' the dirty bit. If the dirty bit is set
> in the old entry then it has to sprinkle it into the new entries with
> atomics.
> 
ISTR some mention to what we are chatting here in the IOMMU SDM:

When a non-default page size is used , software must OR the Dirty bits in
all of the replicated host PTEs used to map the page. The IOMMU does not
guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
of the page may have been written even if the Dirty bit is set in only one
of the replicated PTEs.

>> I wonder if we could start progressing the dirty tracking as a first initial series and
>> then have the split + collapse handling as a second part? That would be quite
>> nice to get me going! :D
> 
> I think so, and I think we should. It is such a big problem space, it
> needs to get broken up.

OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
past except with the x86 support only[*]. And we poke holes there I guess.

[*] I might include Intel too, albeit emulated only.
Jason Gunthorpe Feb. 14, 2022, 2:06 p.m. UTC | #17
On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:

> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> and then write the new valid entry. Not sure I understood correctly that this
> is the 'break-before-make' thingie.

Doesn't that explode if the invalid entry is DMA'd to?

> When a non-default page size is used , software must OR the Dirty bits in
> all of the replicated host PTEs used to map the page. The IOMMU does not
> guarantee the Dirty bits are set in all of the replicated PTEs. Any portion
> of the page may have been written even if the Dirty bit is set in only one
> of the replicated PTEs.

I suspect that is talking about something else

> >> I wonder if we could start progressing the dirty tracking as a first initial series and
> >> then have the split + collapse handling as a second part? That would be quite
> >> nice to get me going! :D
> > 
> > I think so, and I think we should. It is such a big problem space, it
> > needs to get broken up.
> 
> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
> past except with the x86 support only[*]. And we poke holes there I guess.
> 
> [*] I might include Intel too, albeit emulated only.

Like I said, I'd prefer we not build more on the VFIO type 1 code
until we have a conclusion for iommufd..

While returning the dirty data looks straight forward, it is hard to
see an obvious path to enabling and controlling the system iommu the
way vfio is now.

Jason
Joao Martins Feb. 15, 2022, 4 p.m. UTC | #18
On 2/14/22 14:06, Jason Gunthorpe wrote:
> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> 
>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>> and then write the new valid entry. Not sure I understood correctly that this
>> is the 'break-before-make' thingie.
> 
> Doesn't that explode if the invalid entry is DMA'd to?
> 
Yes, IIUC. Also, the manual has this note:

"Note: For example, to split a block into constituent granules
(or to merge a span of granules into an equivalent block), VMSA
requires the region to be made invalid, a TLB invalidate
performed, then to make the region take the new configuration.

Note: The requirement for a break-before-make sequence can cause
problems for unrelated I/O streams that might use addresses
overlapping a region of interest, because the I/O streams cannot
always be conveniently stopped and might not tolerate translation
faults. It is advantageous to perform live update of a block into
smaller translations, or a set of translations into a larger block
size."

Probably why the original SMMUv3.2 dirty track series requires FEAT_BBM
as it had to do in-place atomic updates to split/collapse IO pgtables.
Not enterily clear if HTTU Dirty access requires the same.

>>>> I wonder if we could start progressing the dirty tracking as a first initial series and
>>>> then have the split + collapse handling as a second part? That would be quite
>>>> nice to get me going! :D
>>>
>>> I think so, and I think we should. It is such a big problem space, it
>>> needs to get broken up.
>>
>> OK, cool! I'll stick with the same (slimmed down) IOMMU+VFIO interface as proposed in the
>> past except with the x86 support only[*]. And we poke holes there I guess.
>>
>> [*] I might include Intel too, albeit emulated only.
> 
> Like I said, I'd prefer we not build more on the VFIO type 1 code
> until we have a conclusion for iommufd..
> 

I didn't quite understand what you mean by conclusion.

If by conclusion you mean the whole thing to be merged, how can the work be
broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
in terms of direction...

I can build on top of iommufd -- Just trying to understand how this is
going to work out.

> While returning the dirty data looks straight forward, it is hard to
> see an obvious path to enabling and controlling the system iommu the
> way vfio is now.

It seems strange to have a whole UAPI for userspace [*] meant to
return dirty data to userspace, when dirty right now means the whole
pinned page set and so copying the whole guest ... and the guest is
running so we might be racing with the device changing guest pages with the
VMM/CPU unaware of it. Even with no dwelling of IOMMU pagetables (i.e. split/collapse
IO base pages) it would still help greatly the current status quo of copying
the entire thing :(

Hence my thinking was that the patches /if small/ would let us see how dirty
tracking might work for iommu kAPI (and iommufd) too.

Would it be better to do more iterative steps (when possible) as opposed to
scratch and rebuild VFIO type1 IOMMU handling?

	Joao

[*] VFIO_IOMMU_DIRTY_PAGES{_FLAG_START,_FLAG_STOP,_FLAG_GET_BITMAP}
Jason Gunthorpe Feb. 15, 2022, 4:21 p.m. UTC | #19
On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
> On 2/14/22 14:06, Jason Gunthorpe wrote:
> > On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
> > 
> >> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
> >> and then write the new valid entry. Not sure I understood correctly that this
> >> is the 'break-before-make' thingie.
> > 
> > Doesn't that explode if the invalid entry is DMA'd to?
> > 
> Yes, IIUC. Also, the manual has this note:

Heh, sounds like "this doesn't work" to me :)

> > Like I said, I'd prefer we not build more on the VFIO type 1 code
> > until we have a conclusion for iommufd..
> > 
> 
> I didn't quite understand what you mean by conclusion.

If people are dead-set against doing iommufd, then lets abandon the
idea and go back to hacking up vfio.
 
> If by conclusion you mean the whole thing to be merged, how can the work be
> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> in terms of direction...

I think go ahead and build it on top of iommufd, start working out the
API details, etc. I think once the direction is concluded the new APIs
will go forward.

> > While returning the dirty data looks straight forward, it is hard to
> > see an obvious path to enabling and controlling the system iommu the
> > way vfio is now.
> 
> It seems strange to have a whole UAPI for userspace [*] meant to
> return dirty data to userspace, when dirty right now means the whole
> pinned page set and so copying the whole guest ... 

Yes, the whole thing is only partially implemented, and doesn't have
any in-kernel user. It is another place holder for an implementation
to come someday.

> Hence my thinking was that the patches /if small/ would let us see how dirty
> tracking might work for iommu kAPI (and iommufd) too.

It could be tried, but I think if you go into there you will find it
quickly turns quite complicated to address all the edge cases. Eg what
do you do if you have a mdev present after you turn on system
tracking? What if the mdev is using a PASID? What about hotplug of new
VFIO devices?

Remember, dirty tracking for vfio is totally useless without also
having vfio device migration. Do you already have a migration capable
device to use with this?

> Would it be better to do more iterative steps (when possible) as opposed to
> scratch and rebuild VFIO type1 IOMMU handling?

Possibly, but every thing that gets added has to be carried over to
the new code too, and energy has to be expended trying to figure out
how the half implemented stuff should work while finishing it.

At the very least we must decide what to do with device-provided dirty
tracking before the VFIO type1 stuff can be altered to use the system
IOMMU.

This is very much like the migration FSM, the only appeal is the
existing qemu implementation of the protocol.

Jason
Jason Gunthorpe Feb. 18, 2022, 4:37 p.m. UTC | #20
On Fri, Jul 02, 2021 at 10:58:45AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This series attempts to add vfio live migration support for
> HiSilicon ACC VF devices. HiSilicon ACC VF device MMIO space
> includes both the functional register space and migration 
> control register space. As discussed in RFCv1[0], this may create
> security issues as these regions get shared between the Guest
> driver and the migration driver. Based on the feedback, we tried
> to address those concerns in this version. 
> 
> This is now based on the new vfio-pci-core framework proposal[1].
> Understand that the framework proposal is still under discussion,
> but really appreciate any feedback on the approach taken here
> to mitigate the security risks.

Do you want to try to get this into this kernel along with the mlx5
driver? How RFCy is this?

If you can post a v2 with the changes discussed soon I think there is
a good chance of this.

Thanks,
Jason
Joao Martins Feb. 22, 2022, 11:55 a.m. UTC | #21
On 2/15/22 16:21, Jason Gunthorpe wrote:
> On Tue, Feb 15, 2022 at 04:00:35PM +0000, Joao Martins wrote:
>> On 2/14/22 14:06, Jason Gunthorpe wrote:
>>> On Mon, Feb 14, 2022 at 01:34:15PM +0000, Joao Martins wrote:
>>>
>>>> [*] apparently we need to write an invalid entry first, invalidate the {IO}TLB
>>>> and then write the new valid entry. Not sure I understood correctly that this
>>>> is the 'break-before-make' thingie.
>>>
>>> Doesn't that explode if the invalid entry is DMA'd to?
>>>
>> Yes, IIUC. Also, the manual has this note:
> 
> Heh, sounds like "this doesn't work" to me :)
> 
Yeah, but I remember reading in manual that HTTUD (what ARM calls it for dirty
tracking, albeit DBM is another term for the same thing) requires FEAT_BBM
which avoids us to play the above games. So, supposedly, we can "just"
use atomics with IOPTE changes and IOTLB flush. Not if we need the latter
flush before or after on smmuv3.

>>> Like I said, I'd prefer we not build more on the VFIO type 1 code
>>> until we have a conclusion for iommufd..
>>>
>>
>> I didn't quite understand what you mean by conclusion.
> 
> If people are dead-set against doing iommufd, then lets abandon the
> idea and go back to hacking up vfio.
>  
Heh, I was under the impression everybody was investing so much *because*
that direction was set onto iommufd direction.

>> If by conclusion you mean the whole thing to be merged, how can the work be
>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>> in terms of direction...
> 
> I think go ahead and build it on top of iommufd, start working out the
> API details, etc. I think once the direction is concluded the new APIs
> will go forward.
>
/me nods, will do. Looking at your repository it is looking good.

>>> While returning the dirty data looks straight forward, it is hard to
>>> see an obvious path to enabling and controlling the system iommu the
>>> way vfio is now.
>>
>> It seems strange to have a whole UAPI for userspace [*] meant to
>> return dirty data to userspace, when dirty right now means the whole
>> pinned page set and so copying the whole guest ... 
> 
> Yes, the whole thing is only partially implemented, and doesn't have
> any in-kernel user. It is another place holder for an implementation
> to come someday.
> 
Yeap, seems like.

>> Hence my thinking was that the patches /if small/ would let us see how dirty
>> tracking might work for iommu kAPI (and iommufd) too.
> 
> It could be tried, but I think if you go into there you will find it
> quickly turns quite complicated to address all the edge cases. Eg what
> do you do if you have a mdev present after you turn on system
> tracking? What if the mdev is using a PASID?
> What about hotplug of new
> VFIO devices?
> 
> Remember, dirty tracking for vfio is totally useless without also
> having vfio device migration. 

Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
if we can't do the device part first. But if quiescing DMA and saving
state are two hard requirements that are mandatory for a live migrateable
VF, having dirty tracking in the devices I suspect might be more rare.
So perhaps people will look at IOMMUs as a commodity-workaround to avoid
a whole bunch of hardware logic for dirty tracking, even bearing what it
entails for DMA performance (hisilicon might be an example).

> Do you already have a migration capable
> device to use with this?
> 
Not yet, but soon I hope.

>> Would it be better to do more iterative steps (when possible) as opposed to
>> scratch and rebuild VFIO type1 IOMMU handling?
> 
> Possibly, but every thing that gets added has to be carried over to
> the new code too, and energy has to be expended trying to figure out
> how the half implemented stuff should work while finishing it.
> 
/me nods I understand

> At the very least we must decide what to do with device-provided dirty
> tracking before the VFIO type1 stuff can be altered to use the system
> IOMMU.
> 
I, too, have been wondering what that is going to look like -- and how do we
convey the setup of dirty tracking versus the steering of it.

> This is very much like the migration FSM, the only appeal is the
> existing qemu implementation of the protocol.

Yeah.
Jason Gunthorpe Feb. 23, 2022, 1:03 a.m. UTC | #22
On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:

> > If people are dead-set against doing iommufd, then lets abandon the
> > idea and go back to hacking up vfio.
> >  
> Heh, I was under the impression everybody was investing so much *because*
> that direction was set onto iommufd direction.

Such is the hope :)

> >> If by conclusion you mean the whole thing to be merged, how can the work be
> >> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >> in terms of direction...
> > 
> > I think go ahead and build it on top of iommufd, start working out the
> > API details, etc. I think once the direction is concluded the new APIs
> > will go forward.
> >
> /me nods, will do. Looking at your repository it is looking good.

I would like to come with some plan for dirty tracking on iommufd and
combine that with a plan for dirty tracking inside the new migration
drivers.

> Oh yes -- I am definitely aware. IOMMU/Device Dirty tracking is useless
> if we can't do the device part first. But if quiescing DMA and saving
> state are two hard requirements that are mandatory for a live migrateable
> VF, having dirty tracking in the devices I suspect might be more
> rare.

So far all but one of the live migration devices I know about can do
dirty tracking internally..

> So perhaps people will look at IOMMUs as a commodity-workaround to avoid
> a whole bunch of hardware logic for dirty tracking, even bearing what it
> entails for DMA performance (hisilicon might be an example).

I do expect this will be true.

> > At the very least we must decide what to do with device-provided dirty
> > tracking before the VFIO type1 stuff can be altered to use the system
> > IOMMU.
> 
> I, too, have been wondering what that is going to look like -- and how do we
> convey the setup of dirty tracking versus the steering of it.

What I suggested was to just split them.

Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
would be on a per-domain basis, not on the ioas.

Some ioctl toward the vfio device will turn on the device's tracker.

Userspace has to iterate and query each enabled tracker. This is not
so bad because the time to make the system call is going to be tiny
compared to the time to marshal XXGB of dirty bits.

This makes qemu more complicated because it has to decide what
trackers to turn on, but that is also the point because we do want
userspace to be able to decide.

The other idea that has some possible interest is to allow the
trackers to dump their dirty bits into the existing kvm tracker, then
userspace just does a single kvm centric dirty pass.

Jason
Joao Martins Feb. 25, 2022, 7:18 p.m. UTC | #23
On 2/23/22 01:03, Jason Gunthorpe wrote:
> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>> in terms of direction...
>>>
>>> I think go ahead and build it on top of iommufd, start working out the
>>> API details, etc. I think once the direction is concluded the new APIs
>>> will go forward.
>>>
>> /me nods, will do. Looking at your repository it is looking good.
> 
> I would like to come with some plan for dirty tracking on iommufd and
> combine that with a plan for dirty tracking inside the new migration
> drivers.
> 
I had a few things going on my end over the past weeks, albeit it is
getting a bit better now and I will be coming back to this topic. I hope/want
to give you a more concrete update/feedback over the coming week or two wrt
to dirty-tracking+iommufd+amd.

So far, I am not particularly concerned that this will affect overall iommufd
design. The main thing is really lookups to get vendor iopte, upon on what might
be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking,
I'll be simplifying the interface in the other type1 series I had and making it
a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
enable/disable over a domain. Perhaps same trick could be expanded to other
features to have a bit more control on what userspace is allowed to use. I think
this just needs to set/clear a feature bit in the domain, for VFIO or userspace
to have full control during the different stages of migration of dirty tracking.
In all of the IOMMU implementations/manuals I read it means setting a protection
domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
(albeit past work had also it always-on).

Provided the iommufd does /separately/ more finer granularity on what we can
do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
at will as separate operations, before and after migration respectivally. That logic
would probably be better to be in separate iommufd ioctls(), as that it's going to be
expensive.


>> I, too, have been wondering what that is going to look like -- and how do we
>> convey the setup of dirty tracking versus the steering of it.
> 
> What I suggested was to just split them.
> 
> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> would be on a per-domain basis, not on the ioas.
> 
> Some ioctl toward the vfio device will turn on the device's tracker.
> 
In the activation/fetching-data of either trackers I see some things in common in
terms of UAPI with the difference that whether a device or a list of devices are passed on
as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
your suggestion)

Albeit perhaps the main difference is going to be that one needs to setup with
hardware interface with the device tracker and how we carry the regions of memory that
want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA
space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API
between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped.

In IOMMU this is sort of cheap and 'stateless', but on the setup of the
device tracker might mean giving all the IOVA ranges to the PF (once?).
Perhaps leaving to the vendor driver to pick when to register the IOVA space to
be tracked, or perhaps when you turn on the device's tracker. But on all cases,
the driver needs some form of awareness of and convey that to the PF for
tracking purposes.

> Userspace has to iterate and query each enabled tracker. This is not
> so bad because the time to make the system call is going to be tiny
> compared to the time to marshal XXGB of dirty bits.
> 
Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
following by a smaller cost copying back to userspace (which KVM does too, when it's using
a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
(gup as you mentioned earlier in the thread), or just copying based on the input bitmap
(from PF) number of leading zeroes within some threshold.

> This makes qemu more complicated because it has to decide what
> trackers to turn on, but that is also the point because we do want
> userspace to be able to decide.
> 
If the interface wants extending to pass a device or an array of devices (if I understood
you correctly), it would free/simplify VFIO from having to concatenate potentially
different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
more to avoid performing too much copying.

> The other idea that has some possible interest is to allow the
> trackers to dump their dirty bits into the existing kvm tracker, then
> userspace just does a single kvm centric dirty pass.

That would probably limit certain more modern options of ring based dirty tracking,
as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
would require KVM to always use a bitmap during migration/dirty-rate-estimation with
the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
it's a module).

Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with
reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one.
I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was
made for that purpose).
Jason Gunthorpe Feb. 25, 2022, 8:44 p.m. UTC | #24
On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> On 2/23/22 01:03, Jason Gunthorpe wrote:
> > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>> in terms of direction...
> >>>
> >>> I think go ahead and build it on top of iommufd, start working out the
> >>> API details, etc. I think once the direction is concluded the new APIs
> >>> will go forward.
> >>>
> >> /me nods, will do. Looking at your repository it is looking good.
> > 
> > I would like to come with some plan for dirty tracking on iommufd and
> > combine that with a plan for dirty tracking inside the new migration
> > drivers.
> > 
> I had a few things going on my end over the past weeks, albeit it is
> getting a bit better now and I will be coming back to this topic. I hope/want
> to give you a more concrete update/feedback over the coming week or two wrt
> to dirty-tracking+iommufd+amd.
> 
> So far, I am not particularly concerned that this will affect overall iommufd
> design. The main thing is really lookups to get vendor iopte, upon on what might
> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> the tracking,

I'm not very keen on these multiplexer interfaces. I think you should
just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
'read_dirty_bits'

NULL op means not supported.

IMHO we don't need a kapi wrapper if only iommufd is going to call the
op.

> I'll be simplifying the interface in the other type1 series I had and making it
> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> enable/disable over a domain. Perhaps same trick could be expanded to other
> features to have a bit more control on what userspace is allowed to use. I think
> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> to have full control during the different stages of migration of dirty tracking.
> In all of the IOMMU implementations/manuals I read it means setting a protection
> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> (albeit past work had also it always-on).
> 
> Provided the iommufd does /separately/ more finer granularity on what we can
> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> at will as separate operations, before and after migration respectivally. That logic
> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> expensive.

This all sounds right to me

Questions I have:
 - Do we need ranges for some reason? You mentioned ARM SMMU wants
   ranges? how/what/why?

 - What about the unmap and read dirty without races operation that
   vfio has?

> >> I, too, have been wondering what that is going to look like -- and how do we
> >> convey the setup of dirty tracking versus the steering of it.
> > 
> > What I suggested was to just split them.
> > 
> > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
> > would be on a per-domain basis, not on the ioas.
> > 
> > Some ioctl toward the vfio device will turn on the device's tracker.
> > 
> In the activation/fetching-data of either trackers I see some things in common in
> terms of UAPI with the difference that whether a device or a list of devices are passed on
> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
> your suggestion)

I was thinking a VFIO_DEVICE ioctl located on the device FD
implemented in the end VFIO driver (like mlx5_vfio). No lists..

As you say the driver should just take in the request to set dirty
tracking and take core of it somehow. There is no value the core VFIO
code can add here.

> Albeit perhaps the main difference is going to be that one needs to
> setup with hardware interface with the device tracker and how we
> carry the regions of memory that want to be tracked i.e. GPA/IOVA
> ranges that the device should track. The tracking-GPA space is not
> linear GPA space sadly. But at the same point perhaps the internal
> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
> ranges we mapped.

Yes, this is a point that needs some answering. One option is to pass
in the tracking range list from userspace. Another is to query it in
the driver from the currently mapped areas in IOAS.

I know devices have limitations here in terms of how many/how big the
ranges can be, and devices probably can't track dynamic changes.

> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
> device tracker might mean giving all the IOVA ranges to the PF (once?).
> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
> the driver needs some form of awareness of and convey that to the PF for
> tracking purposes.

Yes, this is right
 
> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
> following by a smaller cost copying back to userspace (which KVM does too, when it's using
> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
> (from PF) number of leading zeroes within some threshold.

What I would probably strive for is an API that deliberately OR's in
the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
in your dirty data', then do the next page. etc. That is 134M of IOVA
per chunk which seems OK.

> > This makes qemu more complicated because it has to decide what
> > trackers to turn on, but that is also the point because we do want
> > userspace to be able to decide.
> > 
> If the interface wants extending to pass a device or an array of devices (if I understood
> you correctly), it would free/simplify VFIO from having to concatenate potentially
> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
> more to avoid performing too much copying.

Yes. Currently VFIO maintains its own bitmap so it also saves that
memory by keeping the dirty bits stored in the IOPTEs until read out.
 
> > The other idea that has some possible interest is to allow the
> > trackers to dump their dirty bits into the existing kvm tracker, then
> > userspace just does a single kvm centric dirty pass.
> 
> That would probably limit certain more modern options of ring based dirty tracking,
> as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least,
> would require KVM to always use a bitmap during migration/dirty-rate-estimation with
> the presence of vfio/vdpa devices. It's a nice idea, though. It would require making
> core-iommu aware of bitmap as external storage for tracking (that is not iommufd as
> it's a module).

Yes, I don't know enough about kvm to say if that is a great idea or
not. The fact the CPUs seem to be going to logging instead of bitmaps
suggests it isn't. I don't think DMA devices can work effectively with
logging..

Jason
Joao Martins Feb. 28, 2022, 1:01 p.m. UTC | #25
On 2/25/22 20:44, Jason Gunthorpe wrote:
> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>> in terms of direction...
>>>>>
>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>> will go forward.
>>>>>
>>>> /me nods, will do. Looking at your repository it is looking good.
>>>
>>> I would like to come with some plan for dirty tracking on iommufd and
>>> combine that with a plan for dirty tracking inside the new migration
>>> drivers.
>>>
>> I had a few things going on my end over the past weeks, albeit it is
>> getting a bit better now and I will be coming back to this topic. I hope/want
>> to give you a more concrete update/feedback over the coming week or two wrt
>> to dirty-tracking+iommufd+amd.
>>
>> So far, I am not particularly concerned that this will affect overall iommufd
>> design. The main thing is really lookups to get vendor iopte, upon on what might
>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>> the tracking,
> 
> I'm not very keen on these multiplexer interfaces. I think you should
> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> 'read_dirty_bits'
> 
> NULL op means not supported.
> 
> IMHO we don't need a kapi wrapper if only iommufd is going to call the
> op.
> 

OK, good point.

Even without a kapi wrapper I am still wondering whether the iommu op needs to
be something like a generic iommu feature toggling (e.g. .set_feature()), rather
than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
the only feature we will change out-of-band in the protection-domain.
I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
because the sync() doesn't only read, but also "writes" to root pagetable to update
the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

And TBH, the question on whether we need a clear op isn't immediately obvious: reading
the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
get_dirty_bitmap() and clear_dirty_ditmap().

Hopefully a futuristic IOMMUs would just get a new DTE/CD/etc field for stashing a set of
PFNs (for a bitmap) that the IOMMU would use for setting the dirty bits there. That is,
rather than forcing sw to split/merge pagetables for better granularity and having to
flush IOTLBs for dirty to be written again :(

>> I'll be simplifying the interface in the other type1 series I had and making it
>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>> enable/disable over a domain. Perhaps same trick could be expanded to other
>> features to have a bit more control on what userspace is allowed to use. I think
>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>> to have full control during the different stages of migration of dirty tracking.
>> In all of the IOMMU implementations/manuals I read it means setting a protection
>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>> (albeit past work had also it always-on).
>>
>> Provided the iommufd does /separately/ more finer granularity on what we can
>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>> at will as separate operations, before and after migration respectivally. That logic
>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>> expensive.
> 
> This all sounds right to me
> 
> Questions I have:
>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>    ranges? how/what/why?
> 
Ignore that. I got mislead by the implementation and when I read the SDM
I realized that the implementation was doing the same thing I was doing
i.e. enabling dirty-bit in the protection domain at start rather than
dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
bit in the context descriptor to enable dirty bits or the STE.S2HD in the
stream table entry for the stage2 equivalent. Nothing here is per-range
basis. And the ranges was only used by the implementation for the eager
splitting/merging of IO page table levels.

>  - What about the unmap and read dirty without races operation that
>    vfio has?
> 
I am afraid that might need a new unmap iommu op that reads the dirty bit
after clearing the page table entry. It's marshalling the bits from
iopte into a bitmap as opposed to some logic added on top. As far as I
looked for AMD this isn't difficult to add, (same for Intel) it can
*I think* reuse all of the unmap code.

>>>> I, too, have been wondering what that is going to look like -- and how do we
>>>> convey the setup of dirty tracking versus the steering of it.
>>>
>>> What I suggested was to just split them.
>>>
>>> Some ioctl toward IOMMUFD will turn on the system iommu tracker - this
>>> would be on a per-domain basis, not on the ioas.
>>>
>>> Some ioctl toward the vfio device will turn on the device's tracker.
>>>
>> In the activation/fetching-data of either trackers I see some things in common in
>> terms of UAPI with the difference that whether a device or a list of devices are passed on
>> as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading
>> your suggestion)
> 
> I was thinking a VFIO_DEVICE ioctl located on the device FD
> implemented in the end VFIO driver (like mlx5_vfio). No lists..
> 
Interesting. I was assuming that we wanted to keep the same ioctl()
interface for dirty-tracking hence me mentioning the device/device-list on
top of the DIRTY ioctl. But well on a second thought it doesn't make sense
given that we query a container and we want to move away from vfio for iopgtbl
related-work and rather into iommufd direct access instead by the VMM. So
placing more dependency on that ioctl wouldn't align with that. I suppose, it
makes sense that this is on a per-vfio-device basis, hence device-vfio ioctl().

> As you say the driver should just take in the request to set dirty
> tracking and take core of it somehow. There is no value the core VFIO
> code can add here.
> 
>> Albeit perhaps the main difference is going to be that one needs to
>> setup with hardware interface with the device tracker and how we
>> carry the regions of memory that want to be tracked i.e. GPA/IOVA
>> ranges that the device should track. The tracking-GPA space is not
>> linear GPA space sadly. But at the same point perhaps the internal
>> VFIO API between core-VFIO and vendor-VFIO is just reading the @dma
>> ranges we mapped.
> 
> Yes, this is a point that needs some answering. One option is to pass
> in the tracking range list from userspace. Another is to query it in
> the driver from the currently mapped areas in IOAS.
> 
I sort of like the second given that we de-duplicate info that is already
tracked by IOAS -- it would be mostly internal and then it would be a
matter of when does this device tracker is set up, and whether we should
differentiate tracker "start"/"stop" vs "setup"/"teardown".

> I know devices have limitations here in terms of how many/how big the
> ranges can be, and devices probably can't track dynamic changes.
> 
/me nods

Should this be some sort of capability perhaps? Given that this may limit
how many concurrent VFs can be migrated and how much ranges it can store,
for example.

(interestingly and speaking of VF capabilities, it's yet another thing to
tackle in the migration stream between src and dst hosts)

>> In IOMMU this is sort of cheap and 'stateless', but on the setup of the
>> device tracker might mean giving all the IOVA ranges to the PF (once?).
>> Perhaps leaving to the vendor driver to pick when to register the IOVA space to
>> be tracked, or perhaps when you turn on the device's tracker. But on all cases,
>> the driver needs some form of awareness of and convey that to the PF for
>> tracking purposes.
> 
> Yes, this is right
>  
>> Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap,
>> following by a smaller cost copying back to userspace (which KVM does too, when it's using
>> a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy
>> (gup as you mentioned earlier in the thread), or just copying based on the input bitmap
>> (from PF) number of leading zeroes within some threshold.
> 
> What I would probably strive for is an API that deliberately OR's in
> the dirty bits. So GUP and kmap a 4k page then call the driver to 'or
> in your dirty data', then do the next page. etc. That is 134M of IOVA
> per chunk which seems OK.
> 
Yeap, sort of what I was aiming for.

>>> This makes qemu more complicated because it has to decide what
>>> trackers to turn on, but that is also the point because we do want
>>> userspace to be able to decide.
>>>
>> If the interface wants extending to pass a device or an array of devices (if I understood
>> you correctly), it would free/simplify VFIO from having to concatenate potentially
>> different devices bitmaps into one. Albeit would require optimizing the marshalling a bit
>> more to avoid performing too much copying.
> 
> Yes. Currently VFIO maintains its own bitmap so it also saves that
> memory by keeping the dirty bits stored in the IOPTEs until read out.
>  
/me nods with the iommu, yes.
Jason Gunthorpe Feb. 28, 2022, 9:01 p.m. UTC | #26
On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
> > On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
> >>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
> >>>>>> in terms of direction...
> >>>>>
> >>>>> I think go ahead and build it on top of iommufd, start working out the
> >>>>> API details, etc. I think once the direction is concluded the new APIs
> >>>>> will go forward.
> >>>>>
> >>>> /me nods, will do. Looking at your repository it is looking good.
> >>>
> >>> I would like to come with some plan for dirty tracking on iommufd and
> >>> combine that with a plan for dirty tracking inside the new migration
> >>> drivers.
> >>>
> >> I had a few things going on my end over the past weeks, albeit it is
> >> getting a bit better now and I will be coming back to this topic. I hope/want
> >> to give you a more concrete update/feedback over the coming week or two wrt
> >> to dirty-tracking+iommufd+amd.
> >>
> >> So far, I am not particularly concerned that this will affect overall iommufd
> >> design. The main thing is really lookups to get vendor iopte, upon on what might
> >> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
> >> the tracking,
> > 
> > I'm not very keen on these multiplexer interfaces. I think you should
> > just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
> > 'read_dirty_bits'
> > 
> > NULL op means not supported.
> > 
> > IMHO we don't need a kapi wrapper if only iommufd is going to call the
> > op.
> > 
> 
> OK, good point.
> 
> Even without a kapi wrapper I am still wondering whether the iommu op needs to
> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
> the only feature we will change out-of-band in the protection-domain.
> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).

I just generally dislike multiplexers like this. We are already
calling through a function pointer struct, why should the driver
implement another switch/case just to find out which function pointer
the caller really ment to use? It doesn't make things faster, it
doesn't make things smaller, it doesn't use less LOC. Why do it?

> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
> because the sync() doesn't only read, but also "writes" to root pagetable to update
> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.

'read and clear' is what I'd suggest

> And TBH, the question on whether we need a clear op isn't immediately obvious: reading
> the access/dirty bit is cheap for the IOMMU, the problem OTOH is the expensive
> io page table walk thus expensive in sw. The clear-dirty part, though, is precise on what
> it wants to clear (in principle cheaper on io-page-table walk as you just iterate over
> sets of bits to clear) but then it incurs a DMA perf hit given that we need to flush the
> IOTLBs. Given the IOTLB flush is batched (over a course of a dirty updates) perhaps this
> isn't immediately clear that is a problem in terms of total overall ioctl cost. Hence my
> thinking in merging both in one sync_dirty_bitmap() as opposed to more KVM-style of
> get_dirty_bitmap() and clear_dirty_ditmap().

Yes, I wouldn't split them.

> > Questions I have:
> >  - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >    ranges? how/what/why?
> > 
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was
> doing

Ok

> >  - What about the unmap and read dirty without races operation that
> >    vfio has?
> > 
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.

Ok. It feels necessary to be complete

> > Yes, this is a point that needs some answering. One option is to pass
> > in the tracking range list from userspace. Another is to query it in
> > the driver from the currently mapped areas in IOAS.
> > 
> I sort of like the second given that we de-duplicate info that is already
> tracked by IOAS -- it would be mostly internal and then it would be a
> matter of when does this device tracker is set up, and whether we should
> differentiate tracker "start"/"stop" vs "setup"/"teardown".

One problem with this is that devices that don't support dynamic
tracking are stuck in vIOMMU cases where the IOAS will have some tiny
set of all memory mapped. 

So I'd probably say qemu should forward the entire guest CPU memory
space, as well as any future memory hotplug area, as ranges and not
rely on the IOAS already mapping anything.

> > I know devices have limitations here in terms of how many/how big the
> > ranges can be, and devices probably can't track dynamic changes.
> > 
> /me nods
> 
> Should this be some sort of capability perhaps? Given that this may limit
> how many concurrent VFs can be migrated and how much ranges it can store,
> for example.
> 
> (interestingly and speaking of VF capabilities, it's yet another thing to
> tackle in the migration stream between src and dst hosts)

I don't know what to do with these limitations right now..

Jason
Joao Martins March 1, 2022, 1:06 p.m. UTC | #27
On 2/28/22 21:01, Jason Gunthorpe wrote:
> On Mon, Feb 28, 2022 at 01:01:39PM +0000, Joao Martins wrote:
>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>>>> If by conclusion you mean the whole thing to be merged, how can the work be
>>>>>>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant
>>>>>>>> in terms of direction...
>>>>>>>
>>>>>>> I think go ahead and build it on top of iommufd, start working out the
>>>>>>> API details, etc. I think once the direction is concluded the new APIs
>>>>>>> will go forward.
>>>>>>>
>>>>>> /me nods, will do. Looking at your repository it is looking good.
>>>>>
>>>>> I would like to come with some plan for dirty tracking on iommufd and
>>>>> combine that with a plan for dirty tracking inside the new migration
>>>>> drivers.
>>>>>
>>>> I had a few things going on my end over the past weeks, albeit it is
>>>> getting a bit better now and I will be coming back to this topic. I hope/want
>>>> to give you a more concrete update/feedback over the coming week or two wrt
>>>> to dirty-tracking+iommufd+amd.
>>>>
>>>> So far, I am not particularly concerned that this will affect overall iommufd
>>>> design. The main thing is really lookups to get vendor iopte, upon on what might
>>>> be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling
>>>> the tracking,
>>>
>>> I'm not very keen on these multiplexer interfaces. I think you should
>>> just add a new ops to the new iommu_domain_ops 'set_dirty_tracking'
>>> 'read_dirty_bits'
>>>
>>> NULL op means not supported.
>>>
>>> IMHO we don't need a kapi wrapper if only iommufd is going to call the
>>> op.
>>>
>>
>> OK, good point.
>>
>> Even without a kapi wrapper I am still wondering whether the iommu op needs to
>> be something like a generic iommu feature toggling (e.g. .set_feature()), rather
>> than one that sits "hardcoded" as set_dirty(). Unless dirty right now is about
>> the only feature we will change out-of-band in the protection-domain.
>> I guess I can stay with set_ad_tracking/set_dirty_tracking and if should
>> need arise we will expand with a generic .set_feature(dom, IOMMU_DIRTY | IOMMU_ACCESS).
> 
> I just generally dislike multiplexers like this. We are already
> calling through a function pointer struct, why should the driver
> implement another switch/case just to find out which function pointer
> the caller really ment to use? It doesn't make things faster, it
> doesn't make things smaller, it doesn't use less LOC. Why do it?
> 
I concur with you in the above but I don't mean it like a multiplexer.
Rather, mimicking the general nature of feature bits in the protection domain
(or the hw pagetable abstraction). So hypothetically for every bit ... if you
wanted to create yet another op that just flips a bit of the
DTEs/PASID-table/CD it would be an excessive use of callbacks to get
to in the iommu_domain_ops if they all set do the same thing.
Right now it's only Dirty (Access bit I don't see immediate use for it
right now) bits, but could there be more? Perhaps this is something to
think about.

Anyways, I am anyways sticking with plain ops function.

>> Regarding the dirty 'data' that's one that I am wondering about. I called it 'sync'
>> because the sync() doesn't only read, but also "writes" to root pagetable to update
>> the dirty bit (and then IOTLB flush). And that's about what VFIO current interface
>> does (i.e. there's only a GET_BITMAP in the ioctl) and no explicit interface to clear.
> 
> 'read and clear' is what I'd suggest
> 
/me nods


>>>  - What about the unmap and read dirty without races operation that
>>>    vfio has?
>>>
>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>> after clearing the page table entry. It's marshalling the bits from
>> iopte into a bitmap as opposed to some logic added on top. As far as I
>> looked for AMD this isn't difficult to add, (same for Intel) it can
>> *I think* reuse all of the unmap code.
> 
> Ok. It feels necessary to be complete
> 
Yes, I guess it's mandatory if we want to fully implement vfio-compat dirty
tracking IOCTLs, too.

>>> Yes, this is a point that needs some answering. One option is to pass
>>> in the tracking range list from userspace. Another is to query it in
>>> the driver from the currently mapped areas in IOAS.
>>>
>> I sort of like the second given that we de-duplicate info that is already
>> tracked by IOAS -- it would be mostly internal and then it would be a
>> matter of when does this device tracker is set up, and whether we should
>> differentiate tracker "start"/"stop" vs "setup"/"teardown".
> 
> One problem with this is that devices that don't support dynamic
> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> set of all memory mapped. 
> 
Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
that you have no limitation of ranges and fw/hw can cope with being fed of
'new-ranges' to enable dirty-tracking. Or does it mean something else at
hw level? Like dirty bitmap being pulled out of device memory, and hw keeps
its own state of dirtying and sw 'just' needs to sync the bitmap every scan.
Jason Gunthorpe March 1, 2022, 1:54 p.m. UTC | #28
On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:

> I concur with you in the above but I don't mean it like a multiplexer.
> Rather, mimicking the general nature of feature bits in the protection domain
> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
> wanted to create yet another op that just flips a bit of the
> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
> to in the iommu_domain_ops if they all set do the same thing.
> Right now it's only Dirty (Access bit I don't see immediate use for it
> right now) bits, but could there be more? Perhaps this is something to
> think about.

Not sure it matters :)

> > One problem with this is that devices that don't support dynamic
> > tracking are stuck in vIOMMU cases where the IOAS will have some tiny
> > set of all memory mapped. 
> > 
> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
> that you have no limitation of ranges and fw/hw can cope with being fed of
> 'new-ranges' to enable dirty-tracking. 

Yes, the ranges can change once the tracking starts, like the normal
IOMMU can do

We are looking at devices where the HW can track a range at the start
of tracking and that range cannot change over time.

So, we need to track the whole guest memory and some extra for memory
hot plug, not just the currently viommu mapped things.

Jason
Joao Martins March 1, 2022, 2:27 p.m. UTC | #29
On 3/1/22 13:54, Jason Gunthorpe wrote:
> On Tue, Mar 01, 2022 at 01:06:30PM +0000, Joao Martins wrote:
> 
>> I concur with you in the above but I don't mean it like a multiplexer.
>> Rather, mimicking the general nature of feature bits in the protection domain
>> (or the hw pagetable abstraction). So hypothetically for every bit ... if you
>> wanted to create yet another op that just flips a bit of the
>> DTEs/PASID-table/CD it would be an excessive use of callbacks to get
>> to in the iommu_domain_ops if they all set do the same thing.
>> Right now it's only Dirty (Access bit I don't see immediate use for it
>> right now) bits, but could there be more? Perhaps this is something to
>> think about.
> 
> Not sure it matters :)
> 
Hehe, most likely it doesn't :)

>>> One problem with this is that devices that don't support dynamic
>>> tracking are stuck in vIOMMU cases where the IOAS will have some tiny
>>> set of all memory mapped. 
>>>
>> Sorry to be pedantic, when you say 'dynamic tracking' for you it just means
>> that you have no limitation of ranges and fw/hw can cope with being fed of
>> 'new-ranges' to enable dirty-tracking. 
> 
> Yes, the ranges can change once the tracking starts, like the normal
> IOMMU can do
> 
> We are looking at devices where the HW can track a range at the start
> of tracking and that range cannot change over time.
> 
> So, we need to track the whole guest memory and some extra for memory
> hot plug, not just the currently viommu mapped things.

I'm aware of the guest memmap accounting foo, so I was already factoring
that in the equation. I was just mainly grokking at hardware limitations
you are coming from to be captured in these future VFIO extensions.

Tracking the max possible GPA (as you hinted earlier) could solve the two natures
of write-tracking, also given that the IOMMU ultimately protects what's really
allowed to be DMA-written to. Unless such tracking-bitmap is placed on device
memory, where space probably needs to be more carefully managed.

Anyways, thanks for the insights!
Joao Martins March 11, 2022, 1:51 p.m. UTC | #30
On 2/28/22 13:01, Joao Martins wrote:
> On 2/25/22 20:44, Jason Gunthorpe wrote:
>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>> I'll be simplifying the interface in the other type1 series I had and making it
>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>> features to have a bit more control on what userspace is allowed to use. I think
>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>> to have full control during the different stages of migration of dirty tracking.
>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>> (albeit past work had also it always-on).
>>>
>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>> at will as separate operations, before and after migration respectivally. That logic
>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>> expensive.
>>
>> This all sounds right to me
>>
>> Questions I have:
>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>    ranges? how/what/why?
>>
> Ignore that. I got mislead by the implementation and when I read the SDM
> I realized that the implementation was doing the same thing I was doing
> i.e. enabling dirty-bit in the protection domain at start rather than
> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> stream table entry for the stage2 equivalent. Nothing here is per-range
> basis. And the ranges was only used by the implementation for the eager
> splitting/merging of IO page table levels.
> 
>>  - What about the unmap and read dirty without races operation that
>>    vfio has?
>>
> I am afraid that might need a new unmap iommu op that reads the dirty bit
> after clearing the page table entry. It's marshalling the bits from
> iopte into a bitmap as opposed to some logic added on top. As far as I
> looked for AMD this isn't difficult to add, (same for Intel) it can
> *I think* reuse all of the unmap code.
> 

OK, made some progress.

It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
revising locking, bugs, and more -- works with my emulated qemu patches which
is a good sign. But hopefully starts some sort of skeleton of what we were
talking about in the above thread.

The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
the vfio-compat one as it was easier provided there's existing userspace to work
with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
steering the dirty tracking, read-clear the dirty bits, unmap and get dirty. But
as we discussed earlier, the one gap of VFIO dirty API is that it lacks controls
for upgrading/downgrading area/IOPTE sizes which is where IOMMUFD would most
likely shine. When that latter part is done we can probably adopt an eager-split
approach inside vfio-compat.

Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
Some of the commits made notes of some of research I made, so *I think* the APIs
introduced capture h/w semantics for all the three IOMMUs supporting dirty
tracking.

I am storing my dirty-tracking bits here:

	https://github.com/jpemartins/linux	iommufd

It's a version of it rebased on top of your iommufd branch.
Jason Gunthorpe March 15, 2022, 7:29 p.m. UTC | #31
On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
> On 2/28/22 13:01, Joao Martins wrote:
> > On 2/25/22 20:44, Jason Gunthorpe wrote:
> >> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
> >>> On 2/23/22 01:03, Jason Gunthorpe wrote:
> >>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
> >>> I'll be simplifying the interface in the other type1 series I had and making it
> >>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
> >>> enable/disable over a domain. Perhaps same trick could be expanded to other
> >>> features to have a bit more control on what userspace is allowed to use. I think
> >>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
> >>> to have full control during the different stages of migration of dirty tracking.
> >>> In all of the IOMMU implementations/manuals I read it means setting a protection
> >>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
> >>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
> >>> (albeit past work had also it always-on).
> >>>
> >>> Provided the iommufd does /separately/ more finer granularity on what we can
> >>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
> >>> at will as separate operations, before and after migration respectivally. That logic
> >>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
> >>> expensive.
> >>
> >> This all sounds right to me
> >>
> >> Questions I have:
> >>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
> >>    ranges? how/what/why?
> >>
> > Ignore that. I got mislead by the implementation and when I read the SDM
> > I realized that the implementation was doing the same thing I was doing
> > i.e. enabling dirty-bit in the protection domain at start rather than
> > dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
> > bit in the context descriptor to enable dirty bits or the STE.S2HD in the
> > stream table entry for the stage2 equivalent. Nothing here is per-range
> > basis. And the ranges was only used by the implementation for the eager
> > splitting/merging of IO page table levels.
> > 
> >>  - What about the unmap and read dirty without races operation that
> >>    vfio has?
> >>
> > I am afraid that might need a new unmap iommu op that reads the dirty bit
> > after clearing the page table entry. It's marshalling the bits from
> > iopte into a bitmap as opposed to some logic added on top. As far as I
> > looked for AMD this isn't difficult to add, (same for Intel) it can
> > *I think* reuse all of the unmap code.
> > 
> 
> OK, made some progress.

Nice!

 
> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
> revising locking, bugs, and more -- works with my emulated qemu patches which
> is a good sign. But hopefully starts some sort of skeleton of what we were
> talking about in the above thread.

I'm a bit bogged with the coming merge window right now, but will have
more to say in a bit
 
> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
> the vfio-compat one as it was easier provided there's existing userspace to work
> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
> steering the dirty tracking, read-clear the dirty bits, unmap and
> get dirty. 

I think this is fine to start experimenting with

> But as we discussed earlier, the one gap of VFIO dirty API is that
> it lacks controls for upgrading/downgrading area/IOPTE sizes which
> is where IOMMUFD would most likely shine. When that latter part is
> done we can probably adopt an eager-split approach inside
> vfio-compat.

I think the native API should be new ioctls that operate on the
hw_pagetable object to:
  - enable/disable dirty tracking 
  - read&clear a bitmap from a range
  - read&unmap a bitmap from a range
  - Manipulate IOPTE sizes

As you say it should not be much distance from the VFIO compat stuff

Most probably I would say to leave dirty tracking out of the type1 api
and compat for it. Maybe we can make some limited cases work back
compat, like the whole ioas supports iommu dirty tracking or
something..

Need to understand if it is wortwhile - remember to use any of this
you need a qemu that is updated to the v2 migration interface,
so there is little practical value in back compat to old qemu if we
expect qemu will use the native interface anyhow.

> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
> Some of the commits made notes of some of research I made, so *I think* the APIs
> introduced capture h/w semantics for all the three IOMMUs supporting dirty
> tracking.

I think the primitives are pretty basic, I'd be surprised if there is
something different :)

Though things to be thinking about are how does this work with nested
translation and other advanced features.

Thanks,
Jason
Joao Martins March 16, 2022, 4:36 p.m. UTC | #32
On 3/15/22 19:29, Jason Gunthorpe wrote:
> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>> On 2/28/22 13:01, Joao Martins wrote:
>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> I'll be simplifying the interface in the other type1 series I had and making it
>>>>> a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can
>>>>> enable/disable over a domain. Perhaps same trick could be expanded to other
>>>>> features to have a bit more control on what userspace is allowed to use. I think
>>>>> this just needs to set/clear a feature bit in the domain, for VFIO or userspace
>>>>> to have full control during the different stages of migration of dirty tracking.
>>>>> In all of the IOMMU implementations/manuals I read it means setting a protection
>>>>> domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on
>>>>> the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent
>>>>> (albeit past work had also it always-on).
>>>>>
>>>>> Provided the iommufd does /separately/ more finer granularity on what we can
>>>>> do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size
>>>>> at will as separate operations, before and after migration respectivally. That logic
>>>>> would probably be better to be in separate iommufd ioctls(), as that it's going to be
>>>>> expensive.
>>>>
>>>> This all sounds right to me
>>>>
>>>> Questions I have:
>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>    ranges? how/what/why?
>>>>
>>> Ignore that. I got mislead by the implementation and when I read the SDM
>>> I realized that the implementation was doing the same thing I was doing
>>> i.e. enabling dirty-bit in the protection domain at start rather than
>>> dynamic toggling. So ARM is similar to Intel/AMD in which you set CD.HD
>>> bit in the context descriptor to enable dirty bits or the STE.S2HD in the
>>> stream table entry for the stage2 equivalent. Nothing here is per-range
>>> basis. And the ranges was only used by the implementation for the eager
>>> splitting/merging of IO page table levels.
>>>
>>>>  - What about the unmap and read dirty without races operation that
>>>>    vfio has?
>>>>
>>> I am afraid that might need a new unmap iommu op that reads the dirty bit
>>> after clearing the page table entry. It's marshalling the bits from
>>> iopte into a bitmap as opposed to some logic added on top. As far as I
>>> looked for AMD this isn't difficult to add, (same for Intel) it can
>>> *I think* reuse all of the unmap code.
>>>
>>
>> OK, made some progress.
> 
> Nice!
> 
>  
>> It's a WIP (here be dragons!) and still missing things e.g. iommufd selftests,
>> revising locking, bugs, and more -- works with my emulated qemu patches which
>> is a good sign. But hopefully starts some sort of skeleton of what we were
>> talking about in the above thread.
> 
> I'm a bit bogged with the coming merge window right now, but will have
> more to say in a bit
> 

Thanks! Take your time :)

>> The bigger TODO, though is the equivalent UAPI for IOMMUFD; I started with
>> the vfio-compat one as it was easier provided there's existing userspace to work
>> with (Qemu). To be fair the API is not that "far" from what would be IOMMUFD onto
>> steering the dirty tracking, read-clear the dirty bits, unmap and
>> get dirty. 
> 
> I think this is fine to start experimenting with
> 
>> But as we discussed earlier, the one gap of VFIO dirty API is that
>> it lacks controls for upgrading/downgrading area/IOPTE sizes which
>> is where IOMMUFD would most likely shine. When that latter part is
>> done we can probably adopt an eager-split approach inside
>> vfio-compat.
> 
> I think the native API should be new ioctls that operate on the
> hw_pagetable object to:
>   - enable/disable dirty tracking 
>   - read&clear a bitmap from a range
>   - read&unmap a bitmap from a range
>   - Manipulate IOPTE sizes
> 
Yeap -- heh nice, it is roughly what I was into already. I will take the
opportunity of doing the zerocopy/gup of the bitmap when writing the
iommufd native ioctls. The vfio-compat is still copying.

other things on my mind for iommufd native interface are:

1) The iopte split *could* be while we read the
dirty bits of a pgsize != than the args pgsize. The
read_and_clear_dirty() is already expensive, but I wonder about the idea
to take advantage of the dirty-clearing needed IOTLB flush, to also also
update the IOPTEs on the next memory-request/translation-request. Albeit
iopte collpase still needs to be routed via a specific ioctl() to promote
to a higher page size (should the migration fail or something).

(I referred to this in the past as lazy-split contrast to explicit IOPTE
size manipulation iommufd ioctls i.e. {manual,eager}-split)

2) I was thinking io_pagetable new APIs instead of hw_pagetable, but if you're
sort of seeing this as a hw_pagetable object primitive, then I am
wondering why? Albeit at the end of the day we operate on the iopt inside the hw_pagetable.

Note: IIUC that hw_pagetable is supposed to model a iommu_domain
direct manipulation which dirty fits in, but wouldn't that be applicable
to _MAP and _UNMAP too maybe?

> As you say it should not be much distance from the VFIO compat stuff
> 
> Most probably I would say to leave dirty tracking out of the type1 api
> and compat for it. Maybe we can make some limited cases work back
> compat, like the whole ioas supports iommu dirty tracking or
> something..
> 
> Need to understand if it is wortwhile - remember to use any of this
> you need a qemu that is updated to the v2 migration interface,
> so there is little practical value in back compat to old qemu if we
> expect qemu will use the native interface anyhow.
> 

Perhaps the value is really just *how much* applications need to get rewritten
to adjust to iommufd versus iteratively adding small (new) parts of it .
migration-v2 didn't turn into a complete rewrite of the vfio initial part iiuc.
Albeit I suspect keeping the compat was perhaps a struggle to keep, which we
might not what we desire if the semantics diverge too much.

e.g. If the semantics are about the same, for example, new VMMs could use the
iommufd new features using IOPTE size change ioctls() via the VFIO_IOAS (if available)
while leaving vfio-compat somewhat usable on the existent dirty ioctls. While slowly
moving to newer iommufd (and less vfio-iommu-type1) until the point you totally deprecate
it. Even without IOPTE size modification in vfio-compat you would still mark as dirty
a lot less than the current logic would (which really represents the whole bitmap).

I guess at the end of the day it is the maintenance cost -- Fortunately
APIs are looking similar and most of vfio-compat is dealing with userspace
args/validation.

Anyhow just some thoughts, open to anything really :)

>> Additionally I also sort of want to skeleton ARM and Intel to see how it looks.
>> Some of the commits made notes of some of research I made, so *I think* the APIs
>> introduced capture h/w semantics for all the three IOMMUs supporting dirty
>> tracking.
> 
> I think the primitives are pretty basic, I'd be surprised if there is
> something different :)
> 
Yeah, they are very basic.

My main unknown in the rework was how the hw protection domain context can deal with
dynamically switching dirty on and off. But I think I answered myself that restriction at
least from groking into specs and experimentation.

> Though things to be thinking about are how does this work with nested
> translation and other advanced features
/me nods

There's usually some flags/bits on the 'stage 2' domain, at least ARM (S2CD) and Intel
(second-level table). But for AMD, you usually control via the vIOMMU what features the hw
exposes[*], and I think there's an extra bit to set to disable vIOMMU dirty tracking from
VMM. But I think still the only thing needed is first-stage tracking, as that's what
matters for L0 live migration.

[*] AMD vIOMMU also offloads some of the command processing, not just the pagetables. Not
sure if ARM/Intel has the equivalent.
Joao Martins March 16, 2022, 8:37 p.m. UTC | #33
On 3/16/22 16:36, Joao Martins wrote:
> On 3/15/22 19:29, Jason Gunthorpe wrote:
>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>> On 2/28/22 13:01, Joao Martins wrote:
>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>> Questions I have:
>>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>>    ranges? how/what/why?
>>>>>

An amend here.

Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
the ARM side. Apparently you have two controls: one is a 'feature bit'
just like x86 and another is a modifier (arm-only).

The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
equivalent for second-level. That's the top-level enabler to actually a *second*
modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
read/write access-flags of the PTE AP[2].

If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
transition in the SMMU from "writable Clean" to "written" means that the the
access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
like it normally would be with DBM = 0). Same thing for stage-2, except that
the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
when it's "writable" (when DBM is also set).

Now you could say that this allows you to control on a per-range basis.
Gah, no, more like a per-PTE basis is more accurate.

And in practice I suppose that means that dynamically switching on/off SMMU
dirty-tracking *dynamically* means not only setting CD.HD but also walking the
page tables, and atomically setting/clearing both the DBM and AP[2].

References:

DDI0487H, Table D5-30 Data access permissions
SMMU 3.2 spec, 3.13.3 Dirty flag hardware update
Joao Martins March 18, 2022, 5:12 p.m. UTC | #34
On 3/16/22 20:37, Joao Martins wrote:
> On 3/16/22 16:36, Joao Martins wrote:
>> On 3/15/22 19:29, Jason Gunthorpe wrote:
>>> On Fri, Mar 11, 2022 at 01:51:32PM +0000, Joao Martins wrote:
>>>> On 2/28/22 13:01, Joao Martins wrote:
>>>>> On 2/25/22 20:44, Jason Gunthorpe wrote:
>>>>>> On Fri, Feb 25, 2022 at 07:18:37PM +0000, Joao Martins wrote:
>>>>>>> On 2/23/22 01:03, Jason Gunthorpe wrote:
>>>>>>>> On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote:
>>>>>> Questions I have:
>>>>>>  - Do we need ranges for some reason? You mentioned ARM SMMU wants
>>>>>>    ranges? how/what/why?
>>>>>>
> 
> An amend here.
> 
> Sigh, ARM turns out is slightly more unique compared to x86. As I am re-reviewing
> the ARM side. Apparently you have two controls: one is a 'feature bit'
> just like x86 and another is a modifier (arm-only).
> 
> The Context descriptor (CD) equivalent to AMD DTEs or Intel context descriptor
> equivalent for second-level. That's the top-level enabler to actually a *second*
> modifier bit per-PTE (or per-TTD for more accurate terminology) which is the so
> called DBM (dirty-bit-modifier). The latter when set, changes the meaning of
> read/write access-flags of the PTE AP[2].
> 
> If you have CD.HD enabled (aka HTTU is enabled) *and* PTE.DBM set, then a
> transition in the SMMU from "writable Clean" to "written" means that the the
> access bits go from "read-only" (AP[2] = 1) to "read/write" (AP[2] = 0)
> if-and-only-if PTE.DBM = 1 (and does not generate a permission IO page fault
> like it normally would be with DBM = 0). Same thing for stage-2, except that
> the access-bits are reversed (S2AP[1] is set when "written" and it's cleared
> when it's "writable" (when DBM is also set).
> 
> Now you could say that this allows you to control on a per-range basis.
> Gah, no, more like a per-PTE basis is more accurate.
> 
> And in practice I suppose that means that dynamically switching on/off SMMU
> dirty-tracking *dynamically* means not only setting CD.HD but also walking the
> page tables, and atomically setting/clearing both the DBM and AP[2].
> 
> References:
> 
> DDI0487H, Table D5-30 Data access permissions
> SMMU 3.2 spec, 3.13.3 Dirty flag hardware update

I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
on the past work) and adjusted the 'set tracking' structure to cover this slightly
different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
which is internal in iommufd obviously. The UAPI won't change ofc.

It's only compile-tested sadly as I have no SMMUv3.2 hardware, and to have this SMMUv3
DBM/HTTU support there's some requirements on the processor that I am not sure they can be
fully emulated. The Intel iommu implementation follows same model as AMD and I will get to
that next with a compliant iommu emulation too.

But now, I will be focusing on hw_pagetable UAPI part. I understand your thinking of being
tied to the hw_pagetable obj as opposed to IOAS for the dirty tracking APIs.
Jason Gunthorpe March 18, 2022, 5:34 p.m. UTC | #35
On Fri, Mar 18, 2022 at 05:12:01PM +0000, Joao Martins wrote:
> I updated my branch and added an SMMUv3 implementation of the whole thing (slightly based
> on the past work) and adjusted the 'set tracking' structure to cover this slightly
> different h/w construct above. At the high-level we have 'set_dirty_tracking_range' API,
> which is internal in iommufd obviously. The UAPI won't change ofc.

Shameerali has HW, AFAIK..

Jason