[v6,00/18] IOMMUFD Dirty Tracking

Message ID	20231024135109.73787-1-joao.m.martins@oracle.com (mailing list archive)
Headers	show Return-Path: <kvm-owner@vger.kernel.org> From: Joao Martins <joao.m.martins@oracle.com> To: iommu@lists.linux.dev Cc: Jason Gunthorpe <jgg@nvidia.com>, Kevin Tian <kevin.tian@intel.com>, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>, Lu Baolu <baolu.lu@linux.intel.com>, Yi Liu <yi.l.liu@intel.com>, Yi Y Sun <yi.y.sun@intel.com>, Nicolin Chen <nicolinc@nvidia.com>, Joerg Roedel <joro@8bytes.org>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, Will Deacon <will@kernel.org>, Robin Murphy <robin.murphy@arm.com>, Zhenzhong Duan <zhenzhong.duan@intel.com>, Alex Williamson <alex.williamson@redhat.com>, kvm@vger.kernel.org, Joao Martins <joao.m.martins@oracle.com> Subject: [PATCH v6 00/18] IOMMUFD Dirty Tracking Date: Tue, 24 Oct 2023 14:50:51 +0100 Message-Id: <20231024135109.73787-1-joao.m.martins@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	IOMMUFD Dirty Tracking \| expand [v6,00/18] IOMMUFD Dirty Tracking [v6,01/18] vfio/iova_bitmap: Export more API symbols [v6,02/18] vfio: Move iova_bitmap into iommufd [v6,03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace [v6,04/18] iommu: Add iommu_domain ops for dirty tracking [v6,05/18] iommufd: Add a flag to enforce dirty tracking on attach [v6,06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKING [v6,07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP [v6,08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO [v6,09/18] iommufd: Add a flag to skip clearing of IOPTE dirty [v6,10/18] iommu/amd: Add domain_alloc_user based domain allocation [v6,11/18] iommu/amd: Access/Dirty bit support in IOPTEs [v6,12/18] iommu/vt-d: Access/Dirty bit support for SS domains [v6,13/18] iommufd/selftest: Expand mock_domain with dev_flags [v6,14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING [v6,15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY_TRACKING [v6,16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP [v6,17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO [v6,18/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag

Joao Martins Oct. 24, 2023, 1:50 p.m. UTC

v6 is a replacement of what's in iommufd next:
https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next

base-commit: b5f9e63278d6f32789478acf1ed41d21d92b36cf

(from the iommufd tree)

=========>8=========

Presented herewith is a series that extends IOMMUFD to have IOMMU
hardware support for dirty bit in the IOPTEs.

Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
alongside VT-D rev3.x also do support.  One intended use-case (but not
restricted!) is to support Live Migration with SR-IOV, specially useful
for live migrateable PCI devices that can't supply its own dirty
tracking hardware blocks amongst others.

At a quick glance, IOMMUFD lets the userspace create the IOAS with a
set of a IOVA ranges mapped to some physical memory composing an IO
pagetable. This is then created via HWPT_ALLOC or attached to a
particular device/hwpt, consequently creating the IOMMU domain and share
a common IO page table representing the endporint DMA-addressable guest
address space. In IOMMUFD Dirty tracking (since v2 of the series) it
will require via the HWPT_ALLOC model only, as opposed to simpler
autodomains model.

The result is an hw_pagetable which represents the
iommu_domain which will be directly manipulated. The IOMMUFD UAPI,
and the iommu/iommufd kAPI are then extended to provide:

1) Enforcement that only devices with dirty tracking support are attached
to an IOMMU domain, to cover the case where this isn't all homogenous in
the platform. While initially this is more aimed at possible heterogenous nature
of ARM while x86 gets future proofed, should any such ocasion occur.

The device dirty tracking enforcement on attach_dev is made whether the
dirty_ops are set or not. Given that attach always checks for dirty
ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to
move to upper layer but semantically iommu driver should do the
checking.

2) Toggling of Dirty Tracking on the iommu_domain. We model as the most
common case of changing hardware translation control structures dynamically
(x86) while making it easier to have an always-enabled mode. In the
RFCv1, the ARM specific case is suggested to be always enabled instead of
having to enable the per-PTE DBM control bit (what I previously called
"range tracking"). Here, setting/clearing tracking means just clearing the
dirty bits at start. The 'real' tracking of whether dirty
tracking is enabled is stored in the IOMMU driver, hence no new
fields are added to iommufd pagetable structures, except for the
iommu_domain dirty ops part via adding a dirty_ops field to
iommu_domain. We use that too for IOMMUFD to know if dirty tracking
is supported and toggleable without having iommu drivers replicate said
checks.

3) Add a capability probing for dirty tracking, leveraging the
per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends
the GET_HW_INFO ioctl which takes a device ID to return some generic
capabilities *in addition*. Possible values enumarated by `enum
iommufd_hw_capabilities`.

4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
indexes on a page_size basis the IOVAs that got written by the device.
While performing the marshalling also drivers need to clear the dirty bits
from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush.
There's no copy of bitmaps to userspace backed memory, all is zerocopy
based to not add more cost to the iommu driver IOPT walker. This shares
functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So
far this is a test-and-clear kind of interface given that the IOPT walk is
going to be expensive. In addition this also adds the ability to read dirty
bit info without clearing the PTE info. This is meant to cover the
unmap-and-read-dirty use-case, and avoid the second IOTLB flush.

The only dependency is:
* Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next).

The series is organized as follows:

* Patches 1-4: Takes care of the iommu domain operations to be added.
The idea is to abstract iommu drivers from any idea of how bitmaps are
stored or propagated back to the caller, as well as allowing
control/batching over IOTLB flush. So there's a data structure and an
helper that only tells the upper layer that an IOVA range got dirty.
This logic is shared with VFIO and it's meant to walking the bitmap
user memory, and kmap-ing plus setting bits as needed. IOMMU driver
just has an idea of a 'dirty bitmap state' and recording an IOVA as
dirty.

* Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The
selftests cover some corner cases on boundaries handling of the bitmap
and various bitmap sizes that exercise. I haven't included huge IOVA
ranges to avoid risking the selftests failing to execute due to OOM
issues of mmaping big buffers.

* Patches 10-11: AMD IOMMU implementation, particularly on those having
HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And
tested with live migration with VFs (but with IOMMU dirty tracking).

* Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu
based intel-iommu vIOMMU with SSADS emulation support[0].

On AMD/Intel I have tested this with emulation and then live migration in
AMD hardware;

The qemu iommu emulation bits are to increase coverage of this code and
hopefully make this more broadly available for fellow contributors/devs,
old version[1]; it uses Yi's 2 commits to have hw_info() supported (still
needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD
QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for
Live migration and with live migration tested. I won't be exactly
following up a v2 of QEMU patches until IOMMUFD tracking lands.

Feedback or any comments are very much appreciated.

Thanks!
        Joao

[0] https://github.com/jpemartins/qemu/commits/iommufd-v3
[1] https://lore.kernel.org/qemu-devel/20220428211351.3897-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-iommu/20230919092523.39286-1-yi.l.liu@intel.com/
[3] https://github.com/jpemartins/linux/commits/iommufd-v3
[4] https://lore.kernel.org/linux-iommu/20230518204650.14541-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/kvm/20220428210933.3583-1-joao.m.martins@oracle.com/
[6] https://github.com/jpemartins/linux/commits/smmu-iommufd-v3
[7] https://lore.kernel.org/linux-iommu/20230923012511.10379-1-joao.m.martins@oracle.com/
[8] https://lore.kernel.org/linux-iommu/20231018202715.69734-1-joao.m.martins@oracle.com/
[9] https://lore.kernel.org/linux-iommu/20231020222804.21850-1-joao.m.martins@oracle.com/

Changes since v5[9]:
* Fixes various linux-next/krobot build warnings and failures namely:
- Move IOMMUFD_DRIVER kconfig definition changes into drivers/iommu/Kconfig
  to address a arm32 issue (with !IOMMU_SUPPORT) module dependency warning
  https://lore.kernel.org/kvm/20231023115520.3530120-1-arnd@kernel.org/
- Implement the !IOMMU_API prototypes for patch 4, fixing a sparc64
  randconfig build warning when !IOMMU_API is done while having ARM32 IO_PGTABLE
  builtin
  https://lore.kernel.org/linux-mm/202310232131.TOhkKzZa-lkp@intel.com/
- Fix htmldocs warnings in the individual patches:
  https://lore.kernel.org/linux-next/20231023164721.1cb43dd6@canb.auug.org.au/
- Fix lack of div_u64 build error in 32-bit builds (i386, arm32, xtensa,
  microblaze) by reworking all checks in iommufd_check_iova_ranges() and moving
  it into iopt_read_and_clear_data(). The checks being redone no longer
  implement a u64 division, while improving the range checks all together.
  My 32-bit kernel compilation failures are at least fixed after this
  and I haven't spotted any warnings from this series.
  https://lore.kernel.org/linux-iommu/b7834d1e-d198-45ee-b15d-12bd235bc57f@app.fastmail.com/
  https://lore.kernel.org/oe-kbuild-all/202310231933.c8k54ybn-lkp@intel.com/
  https://lore.kernel.org/oe-kbuild-all/202310240246.FyXNZrbZ-lkp@intel.com/
  https://lore.kernel.org/lkml/9e56e94d-536e-435a-afb0-4738e6eddedc@infradead.org/
* Comments in v5 not triggered by krobot (yet):
- Fix GET_DIRTY_BITMAP selftest to copy bitmap data (when we set in the
  mock-domain pagetables) instead of directly changing user memory. This was
  leading to selftests run failures as reported by Nicolin Chen
- Fix a build issue of selftests which included bitmap.h header headers,
  by importing the implementation from
  include/asm-generic/bitops/generic-non-atomic.h and avoid the non-UAPI headers
  all together. While doing so use set_bit() instead of __set_bit().
  Also reported by Nicolin Chen
- Fix GET_DIRTY_BITMAP struct declaration to have a u64 data instead of
  a pointer. Change the io_pagetable reflecting bitmap::data type change to have
  u64_to_user_ptr() as it should be done. (Arnd Bergmann)
- Run through clang-format and fix any issues where applicable (3 small
  formatting things only, but left others that were seemingly according to
  the style of the file).
- Rename intel iommu patch subject to have SS instead of SL as the
  latter is old VTd spec naming. And use 'iommu/vt-d' for component name
  instead of 'iommu/intel'. Rename slads_supported() into
  ssads_supported() to adhere to the new naming. (Yi Liu)

Changes since v4[8]:
* Rename HWPT_SET_DIRTY to HWPT_SET_DIRTY_TRACKING 
* Rename IOMMU_CAP_DIRTY to IOMMU_CAP_DIRTY_TRACKING
* Rename HWPT_GET_DIRTY_IOVA to HWPT_GET_DIRTY_BITMAP
* Rename IOMMU_HWPT_ALLOC_ENFORCE_DIRTY to IOMMU_HWPT_ALLOC_DIRTY_TRACKING
  including commit messages, code comments. Additionally change the
  variable in drivers from enforce_dirty to dirty_tracking.
* Reflect all the mass renaming in commit messages/structs/docs.
* Fix the enums prefix to be IOMMU_HWPT like everyone else
* UAPI docs fixes/spelling and minor consistency issues/adjustments
* Change function exit style in __iommu_read_and_clear_dirty to return
  right away instead of storing ret and returning at the end.
* Check 0 page_size and replace find-first-bit + left-shift with a
  simple divide in iommufd_check_iova_range()
* Handle empty iommu domains when setting dirty tracking in intel-iommu;
  Verified and amd-iommu was already the case.
* Remove unnecessary extra check for PGTT type
* Fix comment on function clearing the SLADE bit
* Fix wrong check that validates domain_alloc_user()
  accepted flags in amd-iommu driver
* Skip IOTLB domain flush if no devices exist on the iommu domain,
while setting dirty tracking in amd-iommu driver.
* Collect Reviewed-by tags by Jason, Lu Baolu, Brett, Kevin, Alex

Changes since v3[7]:
* Consolidate previous patch 9 and 10 into a single patch,
while removing iommufd_dirty_bitmap structure to instead use
the UAPI defined structure iommu_hwpt_get_dirty_iova and
pass around internally in iommufd.
* Iterate over areas from within the IOVA bitmap iterator
* Drop check for specific flags in hw_pagetable
* Drop assignment that calculates iopt_alignment, to instead
use iopt_alignment directly
* Move IOVA bitmap into iommufd and introduce IOMMUFD_DRIVER
bool kconfig which designates the usage of dirty tracking related
bitmaps (i.e. VFIO and IOMMU drivers right now).
* Update IOVA bitmap header files to account for IOMMUFD_DRIVER
being disabled
* Sort new IOMMUFD ioctls accordingly
* Move IOVA bitmap symbols to IOMMUFD namespace and update its
users by importing new namespace (new patch 3)
* Remove AMD IOMMU kernel log feature printing from series
* Remove struct amd_iommu from do_iommu_domain_alloc() function
and derive it from within do_iommu_domain_alloc().
* Consolidate pte_test_dirty() and pte_test_and_clear_dirty()
by passing flags and deciding whether to test or test_and_clear.
* Add a comment on top of the -EINVAL attach_device() failure when
dirty tracking is enforcement but IOMMU does not support dirty tracking
* Add Reviewed-by given by Suravee
* Select IOMMUFD_DRIVER if IOMMUFD is enabled on supported IOMMU drivers
* Remove spurious rcu_read_{,un}lock() from Intel/AMD iommus
* Fix unwinding in dirty tracking set failure case in intel-iommu
* Avoid unnecesary locking when checking dmar_domain::dirty_tracking
* Rename intel_iommu_slads_supported() to a slads_supported() macro
following intel-iommu style
* Change the XOR check into a == in set_dirty_tracking iommu op
* Consolidate PTE test or test-and-clear into single helper
* Improve intel_pasid_setup_dirty_tracking() to: use rate limited printk;
avoid unnecessary update if state is already the desired one;
do a clflush on non-coherent devices; remove the qi_flush_piotlb(); and
fail on unsupported pgtts.
* Remove the first_level support and always fail domain_alloc_user on those
cases with it being no use case ATM. Doing so lets us remove some code
for such handling in set_dirty_tracking
* Error out the pasid device attach when dirty tracking is enforced
as we don't support that either.
* Reorganize the series to have selftests at the end, and core/driver
enablement first.

Changes since RFCv2[4]:
* Testing has always occured on the new code, but now it has seen
Live Migration coverage with extra QEMU work on AMD hardware.
* General commit message improvements
* Remove spurious headers in selftests
* Exported some symbols to actually allow things to build when IOMMUFD
is built as a module. (Alex Williamson)
* Switch the enforcing to be done on IOMMU domain allocation via
domain_alloc_user (Jason, Robin, Lu Baolu)
* Removed RCU series from Lu Baolu (Jason)
* Switch set_dirty/read_dirty/clear_dirty to down_read() (Jason)
* Make sure it check for area::pages (Jason)
* Move clearing dirties before set dirty a helper (Jason)
* Avoid breaking IOMMUFD selftests UAPI (Jason)
* General improvements to testing
* Add coverage to new out_capabilities support in HW_INFO.
* Address Shameer/Robin comments in smmu-v3 (code is on a branch[6])
  - Properly check for FEAT_HD together with COHERENCY
  - Remove the pgsize_bitmap check
  - Limit the quirk set to s1 pgtbl_cfg.
  - Fix commit message on dubious sentence on DBM usecase

Changes since RFCv1[5]:
Too many changes but the major items were:
* Majorirty of the changes from Jason/Kevin/Baolu/Suravee:
- Improve structure and rework most commit messages
- Drop all of the VFIO-compat series
- Drop the unmap-get-dirty API
- Tie this to HWPT only, no more autodomains handling;
- Rework the UAPI widely by:
  - Having a IOMMU_DEVICE_GET_CAPS which allows to fetching capabilities
    of devices, specifically test dirty tracking support for an individual
    device
  - Add a enforce-dirty flag to the IOMMU domain via HWPT_ALLOC
  - SET_DIRTY now clears dirty tracking before asking iommu driver to do so;
  - New GET_DIRTY_IOVA flag that does not clear dirty bits
  - Add coverage for all added APIs
  - Expand GET_DIRTY_IOVA tests to cover IOVA bitmap corner cases tests
  that I had in separate; I only excluded the Terabyte IOVA range
  usecases (which test bitmaps 2M+) because those will most likely fail
  to be run as selftests (not sure yet how I can include those). I am
  not exactly sure how I can cover those, unless I do 'fake IOVA maps'
  *somehow* which do not necessarily require real buffers.
- Handle most comments in intel-iommu. Only remaining one for v3 is the
  PTE walker which will be done better.
- Handle all comments in amd-iommu, most of which regarding locking.
  Only one remaining is v3 same as Intel;
- Reflect the UAPI changes into iommu driver implementations, including
persisting dirty tracking enabling in new attach_dev calls, as well as
enforcing attach_dev enforces the requested domain flags;
* Comments from Yi Sun in making sure that dirty tracking isn't
restricted into SS only, so relax the check for FL support because it's
always enabled. (Yi Sun)
* Most of code that was in v1 for dirty bitmaps got rewritten and
repurpose to also cover VFIO case; so reuse this infra here too for both.
(Jason)
* Take Robin's suggestion of always enabling dirty tracking and set_dirty
just clearing bits on 'activation', and make that a generic property to
ensure we always get accurate results between starting and stopping
tracking. (Robin Murphy)
* Address all comments from SMMUv3 into how we enable/test the DBM, or the
bits in the context descriptor with io-pgtable::quirks, etc
(Robin, Shameerali)

Joao Martins (18):
  vfio/iova_bitmap: Export more API symbols
  vfio: Move iova_bitmap into iommufd
  iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
  iommu: Add iommu_domain ops for dirty tracking
  iommufd: Add a flag to enforce dirty tracking on attach
  iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKING
  iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP
  iommufd: Add capabilities to IOMMU_GET_HW_INFO
  iommufd: Add a flag to skip clearing of IOPTE dirty
  iommu/amd: Add domain_alloc_user based domain allocation
  iommu/amd: Access/Dirty bit support in IOPTEs
  iommu/vt-d: Access/Dirty bit support for SS domains
  iommufd/selftest: Expand mock_domain with dev_flags
  iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING
  iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY_TRACKING
  iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP
  iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO
  iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag

 drivers/iommu/Kconfig                         |   4 +
 drivers/iommu/amd/Kconfig                     |   1 +
 drivers/iommu/amd/amd_iommu_types.h           |  12 +
 drivers/iommu/amd/io_pgtable.c                |  68 ++++++
 drivers/iommu/amd/iommu.c                     | 144 +++++++++++-
 drivers/iommu/intel/Kconfig                   |   1 +
 drivers/iommu/intel/iommu.c                   | 103 ++++++++-
 drivers/iommu/intel/iommu.h                   |  16 ++
 drivers/iommu/intel/pasid.c                   | 109 +++++++++
 drivers/iommu/intel/pasid.h                   |   4 +
 drivers/iommu/iommufd/Makefile                |   1 +
 drivers/iommu/iommufd/device.c                |   4 +
 drivers/iommu/iommufd/hw_pagetable.c          |  51 ++++-
 drivers/iommu/iommufd/io_pagetable.c          | 172 ++++++++++++++
 drivers/iommu/iommufd/iommufd_private.h       |  22 ++
 drivers/iommu/iommufd/iommufd_test.h          |  21 ++
 drivers/{vfio => iommu/iommufd}/iova_bitmap.c |   5 +-
 drivers/iommu/iommufd/main.c                  |   7 +
 drivers/iommu/iommufd/selftest.c              | 187 +++++++++++++++-
 drivers/vfio/Makefile                         |   3 +-
 drivers/vfio/pci/mlx5/Kconfig                 |   1 +
 drivers/vfio/pci/mlx5/main.c                  |   1 +
 drivers/vfio/pci/pds/Kconfig                  |   1 +
 drivers/vfio/pci/pds/pci_drv.c                |   1 +
 drivers/vfio/vfio_main.c                      |   1 +
 include/linux/io-pgtable.h                    |   4 +
 include/linux/iommu.h                         |  70 ++++++
 include/linux/iova_bitmap.h                   |  26 +++
 include/uapi/linux/iommufd.h                  |  96 ++++++++
 tools/testing/selftests/iommu/iommufd.c       | 211 ++++++++++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |   2 +-
 tools/testing/selftests/iommu/iommufd_utils.h | 206 ++++++++++++++++-
 32 files changed, 1528 insertions(+), 27 deletions(-)
 rename drivers/{vfio => iommu/iommufd}/iova_bitmap.c (98%)

Jason Gunthorpe Oct. 24, 2023, 3:55 p.m. UTC | #1

On Tue, Oct 24, 2023 at 02:50:51PM +0100, Joao Martins wrote:
> v6 is a replacement of what's in iommufd next:
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
> 
> Joao Martins (18):
>   vfio/iova_bitmap: Export more API symbols
>   vfio: Move iova_bitmap into iommufd
>   iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
>   iommu: Add iommu_domain ops for dirty tracking
>   iommufd: Add a flag to enforce dirty tracking on attach
>   iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKING
>   iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP
>   iommufd: Add capabilities to IOMMU_GET_HW_INFO
>   iommufd: Add a flag to skip clearing of IOPTE dirty
>   iommu/amd: Add domain_alloc_user based domain allocation
>   iommu/amd: Access/Dirty bit support in IOPTEs
>   iommu/vt-d: Access/Dirty bit support for SS domains
>   iommufd/selftest: Expand mock_domain with dev_flags
>   iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING
>   iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY_TRACKING
>   iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP
>   iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO
>   iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag

Ok, I refreshed the series, thanks!

Jason

Nicolin Chen Oct. 24, 2023, 6:11 p.m. UTC | #2

On Tue, Oct 24, 2023 at 12:55:12PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 24, 2023 at 02:50:51PM +0100, Joao Martins wrote:
> > v6 is a replacement of what's in iommufd next:
> > https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
> > 
> > Joao Martins (18):
> >   vfio/iova_bitmap: Export more API symbols
> >   vfio: Move iova_bitmap into iommufd
> >   iommufd/iova_bitmap: Move symbols to IOMMUFD namespace
> >   iommu: Add iommu_domain ops for dirty tracking
> >   iommufd: Add a flag to enforce dirty tracking on attach
> >   iommufd: Add IOMMU_HWPT_SET_DIRTY_TRACKING
> >   iommufd: Add IOMMU_HWPT_GET_DIRTY_BITMAP
> >   iommufd: Add capabilities to IOMMU_GET_HW_INFO
> >   iommufd: Add a flag to skip clearing of IOPTE dirty
> >   iommu/amd: Add domain_alloc_user based domain allocation
> >   iommu/amd: Access/Dirty bit support in IOPTEs
> >   iommu/vt-d: Access/Dirty bit support for SS domains
> >   iommufd/selftest: Expand mock_domain with dev_flags
> >   iommufd/selftest: Test IOMMU_HWPT_ALLOC_DIRTY_TRACKING
> >   iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY_TRACKING
> >   iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP
> >   iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO
> >   iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_BITMAP_NO_CLEAR flag
> 
> Ok, I refreshed the series, thanks!

Selftest is passing with this version.

Cheers
Nicolin

Zhangfei Gao Oct. 29, 2024, 2:35 a.m. UTC | #3

Hi, Joao

On Tue, 24 Oct 2023 at 21:51, Joao Martins <joao.m.martins@oracle.com> wrote:
>
> v6 is a replacement of what's in iommufd next:
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
>
> base-commit: b5f9e63278d6f32789478acf1ed41d21d92b36cf
>
> (from the iommufd tree)
>
> =========>8=========
>
> Presented herewith is a series that extends IOMMUFD to have IOMMU
> hardware support for dirty bit in the IOPTEs.
>
> Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> alongside VT-D rev3.x also do support.  One intended use-case (but not
> restricted!) is to support Live Migration with SR-IOV, specially useful
> for live migrateable PCI devices that can't supply its own dirty
> tracking hardware blocks amongst others.
>
> At a quick glance, IOMMUFD lets the userspace create the IOAS with a
> set of a IOVA ranges mapped to some physical memory composing an IO
> pagetable. This is then created via HWPT_ALLOC or attached to a
> particular device/hwpt, consequently creating the IOMMU domain and share
> a common IO page table representing the endporint DMA-addressable guest
> address space. In IOMMUFD Dirty tracking (since v2 of the series) it
> will require via the HWPT_ALLOC model only, as opposed to simpler
> autodomains model.
>
> The result is an hw_pagetable which represents the
> iommu_domain which will be directly manipulated. The IOMMUFD UAPI,
> and the iommu/iommufd kAPI are then extended to provide:
>
> 1) Enforcement that only devices with dirty tracking support are attached
> to an IOMMU domain, to cover the case where this isn't all homogenous in
> the platform. While initially this is more aimed at possible heterogenous nature
> of ARM while x86 gets future proofed, should any such ocasion occur.
>
> The device dirty tracking enforcement on attach_dev is made whether the
> dirty_ops are set or not. Given that attach always checks for dirty
> ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to
> move to upper layer but semantically iommu driver should do the
> checking.
>
> 2) Toggling of Dirty Tracking on the iommu_domain. We model as the most
> common case of changing hardware translation control structures dynamically
> (x86) while making it easier to have an always-enabled mode. In the
> RFCv1, the ARM specific case is suggested to be always enabled instead of
> having to enable the per-PTE DBM control bit (what I previously called
> "range tracking"). Here, setting/clearing tracking means just clearing the
> dirty bits at start. The 'real' tracking of whether dirty
> tracking is enabled is stored in the IOMMU driver, hence no new
> fields are added to iommufd pagetable structures, except for the
> iommu_domain dirty ops part via adding a dirty_ops field to
> iommu_domain. We use that too for IOMMUFD to know if dirty tracking
> is supported and toggleable without having iommu drivers replicate said
> checks.
>
> 3) Add a capability probing for dirty tracking, leveraging the
> per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends
> the GET_HW_INFO ioctl which takes a device ID to return some generic
> capabilities *in addition*. Possible values enumarated by `enum
> iommufd_hw_capabilities`.
>
> 4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
> indexes on a page_size basis the IOVAs that got written by the device.
> While performing the marshalling also drivers need to clear the dirty bits
> from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush.
> There's no copy of bitmaps to userspace backed memory, all is zerocopy
> based to not add more cost to the iommu driver IOPT walker. This shares
> functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So
> far this is a test-and-clear kind of interface given that the IOPT walk is
> going to be expensive. In addition this also adds the ability to read dirty
> bit info without clearing the PTE info. This is meant to cover the
> unmap-and-read-dirty use-case, and avoid the second IOTLB flush.
>
> The only dependency is:
> * Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next).
>
> The series is organized as follows:
>
> * Patches 1-4: Takes care of the iommu domain operations to be added.
> The idea is to abstract iommu drivers from any idea of how bitmaps are
> stored or propagated back to the caller, as well as allowing
> control/batching over IOTLB flush. So there's a data structure and an
> helper that only tells the upper layer that an IOVA range got dirty.
> This logic is shared with VFIO and it's meant to walking the bitmap
> user memory, and kmap-ing plus setting bits as needed. IOMMU driver
> just has an idea of a 'dirty bitmap state' and recording an IOVA as
> dirty.
>
> * Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The
> selftests cover some corner cases on boundaries handling of the bitmap
> and various bitmap sizes that exercise. I haven't included huge IOVA
> ranges to avoid risking the selftests failing to execute due to OOM
> issues of mmaping big buffers.
>
> * Patches 10-11: AMD IOMMU implementation, particularly on those having
> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And
> tested with live migration with VFs (but with IOMMU dirty tracking).
>
> * Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu
> based intel-iommu vIOMMU with SSADS emulation support[0].
>
> On AMD/Intel I have tested this with emulation and then live migration in
> AMD hardware;
>
> The qemu iommu emulation bits are to increase coverage of this code and
> hopefully make this more broadly available for fellow contributors/devs,
> old version[1]; it uses Yi's 2 commits to have hw_info() supported (still
> needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD
> QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for
> Live migration and with live migration tested. I won't be exactly
> following up a v2 of QEMU patches until IOMMUFD tracking lands.
>
> Feedback or any comments are very much appreciated.
>
> Thanks!
>         Joao

Is this patchset enough for iommufd live migration?

Just tried live migration in local machine,
reports "VFIO migration is not supported in kernel"

Thanks

Yi Liu Oct. 29, 2024, 2:52 a.m. UTC | #4

On 2024/10/29 10:35, Zhangfei Gao wrote:
> VFIO migration is not supported in kernel

do you have a vfio-pci-xxx driver that suits your device? Looks
like your case failed when checking the VFIO_DEVICE_FEATURE_GET | 
VFIO_DEVICE_FEATURE_MIGRATION via VFIO_DEVICE_FEATURE.

Zhangfei Gao Oct. 29, 2024, 8:05 a.m. UTC | #5

On Tue, 29 Oct 2024 at 10:48, Yi Liu <yi.l.liu@intel.com> wrote:
>
> On 2024/10/29 10:35, Zhangfei Gao wrote:
> > VFIO migration is not supported in kernel
>
> do you have a vfio-pci-xxx driver that suits your device? Looks
> like your case failed when checking the VFIO_DEVICE_FEATURE_GET |
> VFIO_DEVICE_FEATURE_MIGRATION via VFIO_DEVICE_FEATURE.

Thanks Yi for the guidance.

Yes,
ioctl VFIO_DEVICE_FEATURE  with VFIO_DEVICE_FEATURE_MIGRATION fails.
Since
if (!device->mig_ops)
    return -ENOTTY;

Now drivers/vfio/pci/vfio_pci.c is used, without mig_ops.
Looks like I have to use other devices like
drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c, still in check.

Thanks Yi.

Yi Liu Oct. 29, 2024, 9:34 a.m. UTC | #6

On 2024/10/29 16:05, Zhangfei Gao wrote:
> On Tue, 29 Oct 2024 at 10:48, Yi Liu <yi.l.liu@intel.com> wrote:
>>
>> On 2024/10/29 10:35, Zhangfei Gao wrote:
>>> VFIO migration is not supported in kernel
>>
>> do you have a vfio-pci-xxx driver that suits your device? Looks
>> like your case failed when checking the VFIO_DEVICE_FEATURE_GET |
>> VFIO_DEVICE_FEATURE_MIGRATION via VFIO_DEVICE_FEATURE.
> 
> Thanks Yi for the guidance.
> 
> Yes,
> ioctl VFIO_DEVICE_FEATURE  with VFIO_DEVICE_FEATURE_MIGRATION fails.
> Since
> if (!device->mig_ops)
>      return -ENOTTY;
> 
> Now drivers/vfio/pci/vfio_pci.c is used, without mig_ops.
> Looks like I have to use other devices like
> drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c, still in check.

indeed. You need to bind your device to a variant driver named vfio-pci-xxx
which will provide the mig_ops and even dirty logging feature. Good luck.

Zhangfei Gao Oct. 30, 2024, 3:15 p.m. UTC | #7

On Tue, 24 Oct 2023 at 21:51, Joao Martins <joao.m.martins@oracle.com> wrote:
>
> v6 is a replacement of what's in iommufd next:
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
>
> base-commit: b5f9e63278d6f32789478acf1ed41d21d92b36cf
>
> (from the iommufd tree)
>
> =========>8=========
>
> Presented herewith is a series that extends IOMMUFD to have IOMMU
> hardware support for dirty bit in the IOPTEs.
>
> Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> alongside VT-D rev3.x also do support.  One intended use-case (but not
> restricted!) is to support Live Migration with SR-IOV, specially useful
> for live migrateable PCI devices that can't supply its own dirty
> tracking hardware blocks amongst others.
>
> At a quick glance, IOMMUFD lets the userspace create the IOAS with a
> set of a IOVA ranges mapped to some physical memory composing an IO
> pagetable. This is then created via HWPT_ALLOC or attached to a
> particular device/hwpt, consequently creating the IOMMU domain and share
> a common IO page table representing the endporint DMA-addressable guest
> address space. In IOMMUFD Dirty tracking (since v2 of the series) it
> will require via the HWPT_ALLOC model only, as opposed to simpler
> autodomains model.
>
> The result is an hw_pagetable which represents the
> iommu_domain which will be directly manipulated. The IOMMUFD UAPI,
> and the iommu/iommufd kAPI are then extended to provide:
>
> 1) Enforcement that only devices with dirty tracking support are attached
> to an IOMMU domain, to cover the case where this isn't all homogenous in
> the platform. While initially this is more aimed at possible heterogenous nature
> of ARM while x86 gets future proofed, should any such ocasion occur.
>
> The device dirty tracking enforcement on attach_dev is made whether the
> dirty_ops are set or not. Given that attach always checks for dirty
> ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to
> move to upper layer but semantically iommu driver should do the
> checking.
>
> 2) Toggling of Dirty Tracking on the iommu_domain. We model as the most
> common case of changing hardware translation control structures dynamically
> (x86) while making it easier to have an always-enabled mode. In the
> RFCv1, the ARM specific case is suggested to be always enabled instead of
> having to enable the per-PTE DBM control bit (what I previously called
> "range tracking"). Here, setting/clearing tracking means just clearing the
> dirty bits at start. The 'real' tracking of whether dirty
> tracking is enabled is stored in the IOMMU driver, hence no new
> fields are added to iommufd pagetable structures, except for the
> iommu_domain dirty ops part via adding a dirty_ops field to
> iommu_domain. We use that too for IOMMUFD to know if dirty tracking
> is supported and toggleable without having iommu drivers replicate said
> checks.
>
> 3) Add a capability probing for dirty tracking, leveraging the
> per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends
> the GET_HW_INFO ioctl which takes a device ID to return some generic
> capabilities *in addition*. Possible values enumarated by `enum
> iommufd_hw_capabilities`.
>
> 4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap
> indexes on a page_size basis the IOVAs that got written by the device.
> While performing the marshalling also drivers need to clear the dirty bits
> from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush.
> There's no copy of bitmaps to userspace backed memory, all is zerocopy
> based to not add more cost to the iommu driver IOPT walker. This shares
> functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So
> far this is a test-and-clear kind of interface given that the IOPT walk is
> going to be expensive. In addition this also adds the ability to read dirty
> bit info without clearing the PTE info. This is meant to cover the
> unmap-and-read-dirty use-case, and avoid the second IOTLB flush.
>
> The only dependency is:
> * Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next).
>
> The series is organized as follows:
>
> * Patches 1-4: Takes care of the iommu domain operations to be added.
> The idea is to abstract iommu drivers from any idea of how bitmaps are
> stored or propagated back to the caller, as well as allowing
> control/batching over IOTLB flush. So there's a data structure and an
> helper that only tells the upper layer that an IOVA range got dirty.
> This logic is shared with VFIO and it's meant to walking the bitmap
> user memory, and kmap-ing plus setting bits as needed. IOMMU driver
> just has an idea of a 'dirty bitmap state' and recording an IOVA as
> dirty.
>
> * Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The
> selftests cover some corner cases on boundaries handling of the bitmap
> and various bitmap sizes that exercise. I haven't included huge IOVA
> ranges to avoid risking the selftests failing to execute due to OOM
> issues of mmaping big buffers.
>
> * Patches 10-11: AMD IOMMU implementation, particularly on those having
> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And
> tested with live migration with VFs (but with IOMMU dirty tracking).
>
> * Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu
> based intel-iommu vIOMMU with SSADS emulation support[0].
>
> On AMD/Intel I have tested this with emulation and then live migration in
> AMD hardware;
>
> The qemu iommu emulation bits are to increase coverage of this code and
> hopefully make this more broadly available for fellow contributors/devs,
> old version[1]; it uses Yi's 2 commits to have hw_info() supported (still
> needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD
> QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for
> Live migration and with live migration tested. I won't be exactly
> following up a v2 of QEMU patches until IOMMUFD tracking lands.
>
> Feedback or any comments are very much appreciated.
>
> Thanks!
>         Joao

Hi, Joao and Yi

I just tried this on aarch64, live migration with "iommu=nested-smmuv3"
It does not work.

vbasedev->dirty_pages_supported=0

qemu-system-aarch64: -device
vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0,enable-migration=on,x-pre-copy-dirty-page-tracking=off:
warning: 0000:75:00.1: VFIO device doesn't support device and IOMMU
dirty tracking

qemu-system-aarch64: -device
vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0,enable-migration=on,x-pre-copy-dirty-page-tracking=off:
vfio 0000:75:00.1: 0000:75:00.1: Migration is currently not supported
with vIOMMU enabled

hw/vfio/migration.c
    if (vfio_viommu_preset(vbasedev)) {
        error_setg(&err, "%s: Migration is currently not supported "
                   "with vIOMMU enabled", vbasedev->name);
        goto add_blocker;
    }

Is this mean live migration with vIOMMU is still not ready,
It is not an error. It is how they are blocking migration till all
other related feature support is added for vIOMMU.
And still need more work to enable  migration with vIOMMU?

By the way, live migration works if removing "iommu=nested-smmuv3".

Any suggestions?

Thanks

Jason Gunthorpe Oct. 30, 2024, 3:36 p.m. UTC | #8

On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
> hw/vfio/migration.c
>     if (vfio_viommu_preset(vbasedev)) {
>         error_setg(&err, "%s: Migration is currently not supported "
>                    "with vIOMMU enabled", vbasedev->name);
>         goto add_blocker;
>     }

The viommu driver itself does not support live migration, it would
need to preserve all the guest configuration and bring it all back. It
doesn't know how to do that yet.

Jason

Joao Martins Oct. 30, 2024, 3:47 p.m. UTC | #9

On 30/10/2024 15:36, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
>> hw/vfio/migration.c
>>     if (vfio_viommu_preset(vbasedev)) {
>>         error_setg(&err, "%s: Migration is currently not supported "
>>                    "with vIOMMU enabled", vbasedev->name);
>>         goto add_blocker;
>>     }
> 
> The viommu driver itself does not support live migration, it would
> need to preserve all the guest configuration and bring it all back. It
> doesn't know how to do that yet.

It's more of vfio code, not quite related to actually hw vIOMMU.

There's some vfio migration + vIOMMU support patches I have to follow up (v5)
but unexpected set backs unrelated to work delay some of my plans for qemu 9.2.
I expect to resume in few weeks. I can point you to a branch while I don't
submit (provided soft-freeze is coming)

	Joao

Shameerali Kolothum Thodi Oct. 30, 2024, 3:57 p.m. UTC | #10

Hi Joao,

> -----Original Message-----
> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Wednesday, October 30, 2024 3:47 PM
> To: Jason Gunthorpe <jgg@nvidia.com>; Zhangfei Gao
> <zhangfei.gao@linaro.org>
> Cc: iommu@lists.linux.dev; Kevin Tian <kevin.tian@intel.com>; Shameerali
> Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; Lu Baolu
> <baolu.lu@linux.intel.com>; Yi Liu <yi.l.liu@intel.com>; Yi Y Sun
> <yi.y.sun@intel.com>; Nicolin Chen <nicolinc@nvidia.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Zhenzhong Duan
> <zhenzhong.duan@intel.com>; Alex Williamson
> <alex.williamson@redhat.com>; kvm@vger.kernel.org; Shameer Kolothum
> <shamiali2008@gmail.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [PATCH v6 00/18] IOMMUFD Dirty Tracking
> 
> On 30/10/2024 15:36, Jason Gunthorpe wrote:
> > On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
> >> hw/vfio/migration.c
> >>     if (vfio_viommu_preset(vbasedev)) {
> >>         error_setg(&err, "%s: Migration is currently not supported "
> >>                    "with vIOMMU enabled", vbasedev->name);
> >>         goto add_blocker;
> >>     }
> >
> > The viommu driver itself does not support live migration, it would
> > need to preserve all the guest configuration and bring it all back. It
> > doesn't know how to do that yet.
> 
> It's more of vfio code, not quite related to actually hw vIOMMU.
> 
> There's some vfio migration + vIOMMU support patches I have to follow up
> (v5)

Are you referring this series here?
https://lore.kernel.org/qemu-devel/d5d30f58-31f0-1103-6956-377de34a790c@redhat.com/T/

Is that enabling migration only if Guest doesn’t do any DMA translations? 

> but unexpected set backs unrelated to work delay some of my plans for
> qemu 9.2.
> I expect to resume in few weeks. I can point you to a branch while I don't
> submit (provided soft-freeze is coming)

Also, I think we need a mechanism for page fault handling in case Guest handles 
the stage 1 plus dirty tracking for stage 1 as well.

Thanks,
Shameer

Joao Martins Oct. 30, 2024, 4:56 p.m. UTC | #11

On 30/10/2024 15:57, Shameerali Kolothum Thodi wrote:
>> On 30/10/2024 15:36, Jason Gunthorpe wrote:
>>> On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
>>>> hw/vfio/migration.c
>>>>     if (vfio_viommu_preset(vbasedev)) {
>>>>         error_setg(&err, "%s: Migration is currently not supported "
>>>>                    "with vIOMMU enabled", vbasedev->name);
>>>>         goto add_blocker;
>>>>     }
>>>
>>> The viommu driver itself does not support live migration, it would
>>> need to preserve all the guest configuration and bring it all back. It
>>> doesn't know how to do that yet.
>>
>> It's more of vfio code, not quite related to actually hw vIOMMU.
>>
>> There's some vfio migration + vIOMMU support patches I have to follow up
>> (v5)
> 
> Are you referring this series here?
> https://lore.kernel.org/qemu-devel/d5d30f58-31f0-1103-6956-377de34a790c@redhat.com/T/
> 
> Is that enabling migration only if Guest doesn’t do any DMA translations? 
> 
No, it does it when guest is using the sw vIOMMU too. to be clear: this has
nothing to do with nested IOMMU or if the guest is doing (emulated) dirty tracking.

When the guest doesn't do DMA translation is this patch:

https://lore.kernel.org/qemu-devel/20230908120521.50903-1-joao.m.martins@oracle.com/

>> but unexpected set backs unrelated to work delay some of my plans for
>> qemu 9.2.
>> I expect to resume in few weeks. I can point you to a branch while I don't
>> submit (provided soft-freeze is coming)
> 
> Also, I think we need a mechanism for page fault handling in case Guest handles 
> the stage 1 plus dirty tracking for stage 1 as well.
> 

I have emulation for x86 iommus to dirty tracking, but that is unrelated to L0
live migration -- It's more for testing in the lack of recent hardware. Even
emulated page fault handling doesn't affect this unless you have to re-map/map
new IOVA, which would also be covered in this series I think.

Unless you are talking about physical IOPF that qemu may terminate, though we
don't have such support in Qemu atm.

Shameerali Kolothum Thodi Oct. 30, 2024, 6:41 p.m. UTC | #12

> -----Original Message-----
> From: Joao Martins <joao.m.martins@oracle.com>
> Sent: Wednesday, October 30, 2024 4:57 PM
> To: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Zhangfei Gao <zhangfei.gao@linaro.org>
> Cc: iommu@lists.linux.dev; Kevin Tian <kevin.tian@intel.com>; Lu Baolu
> <baolu.lu@linux.intel.com>; Yi Liu <yi.l.liu@intel.com>; Yi Y Sun
> <yi.y.sun@intel.com>; Nicolin Chen <nicolinc@nvidia.com>; Joerg Roedel
> <joro@8bytes.org>; Suravee Suthikulpanit
> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
> Murphy <robin.murphy@arm.com>; Zhenzhong Duan
> <zhenzhong.duan@intel.com>; Alex Williamson
> <alex.williamson@redhat.com>; kvm@vger.kernel.org; Shameer Kolothum
> <shamiali2008@gmail.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
> Subject: Re: [PATCH v6 00/18] IOMMUFD Dirty Tracking
> 
> On 30/10/2024 15:57, Shameerali Kolothum Thodi wrote:
> >> On 30/10/2024 15:36, Jason Gunthorpe wrote:
> >>> On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
> >>>> hw/vfio/migration.c
> >>>>     if (vfio_viommu_preset(vbasedev)) {
> >>>>         error_setg(&err, "%s: Migration is currently not supported "
> >>>>                    "with vIOMMU enabled", vbasedev->name);
> >>>>         goto add_blocker;
> >>>>     }
> >>>
> >>> The viommu driver itself does not support live migration, it would
> >>> need to preserve all the guest configuration and bring it all back. It
> >>> doesn't know how to do that yet.
> >>
> >> It's more of vfio code, not quite related to actually hw vIOMMU.
> >>
> >> There's some vfio migration + vIOMMU support patches I have to follow
> up
> >> (v5)
> >
> > Are you referring this series here?
> > https://lore.kernel.org/qemu-devel/d5d30f58-31f0-1103-6956-
> 377de34a790c@redhat.com/T/
> >
> > Is that enabling migration only if Guest doesn’t do any DMA translations?
> >
> No, it does it when guest is using the sw vIOMMU too. to be clear: this has
> nothing to do with nested IOMMU or if the guest is doing (emulated) dirty
> tracking.

Ok. Thanks for explaining. So just to clarify, this works for Intel vt-d with
" caching-mode=on". ie, no real 2 stage setup is required  like in ARM
 SMMUv3.

> When the guest doesn't do DMA translation is this patch:
> 
> https://lore.kernel.org/qemu-devel/20230908120521.50903-1-
> joao.m.martins@oracle.com/

Ok.

> 
> >> but unexpected set backs unrelated to work delay some of my plans for
> >> qemu 9.2.
> >> I expect to resume in few weeks. I can point you to a branch while I don't
> >> submit (provided soft-freeze is coming)
> >
> > Also, I think we need a mechanism for page fault handling in case Guest
> handles
> > the stage 1 plus dirty tracking for stage 1 as well.
> >
> 
> I have emulation for x86 iommus to dirty tracking, but that is unrelated to
> L0
> live migration -- It's more for testing in the lack of recent hardware. Even
> emulated page fault handling doesn't affect this unless you have to re-
> map/map
> new IOVA, which would also be covered in this series I think.
> 
> Unless you are talking about physical IOPF that qemu may terminate,
> though we
> don't have such support in Qemu atm.

Yeah I was referring to ARM SMMUv3 cases, where we need nested-smmuv3
support for vfio-pci assignment. Another usecase we have is support SVA in
Guest,  with hardware capable  of physical IOPF.

I will take a look at your series above and will see what else is required
to support ARM. Please CC if you plan to respin or have a latest branch.
Thanks for your efforts.

Shameer

Joao Martins Oct. 30, 2024, 8:37 p.m. UTC | #13

On 30/10/2024 18:41, Shameerali Kolothum Thodi wrote:
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Sent: Wednesday, October 30, 2024 4:57 PM
>> To: Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; Jason Gunthorpe
>> <jgg@nvidia.com>; Zhangfei Gao <zhangfei.gao@linaro.org>
>> Cc: iommu@lists.linux.dev; Kevin Tian <kevin.tian@intel.com>; Lu Baolu
>> <baolu.lu@linux.intel.com>; Yi Liu <yi.l.liu@intel.com>; Yi Y Sun
>> <yi.y.sun@intel.com>; Nicolin Chen <nicolinc@nvidia.com>; Joerg Roedel
>> <joro@8bytes.org>; Suravee Suthikulpanit
>> <suravee.suthikulpanit@amd.com>; Will Deacon <will@kernel.org>; Robin
>> Murphy <robin.murphy@arm.com>; Zhenzhong Duan
>> <zhenzhong.duan@intel.com>; Alex Williamson
>> <alex.williamson@redhat.com>; kvm@vger.kernel.org; Shameer Kolothum
>> <shamiali2008@gmail.com>; Wangzhou (B) <wangzhou1@hisilicon.com>
>> Subject: Re: [PATCH v6 00/18] IOMMUFD Dirty Tracking
>> On 30/10/2024 15:57, Shameerali Kolothum Thodi wrote:
>>>> On 30/10/2024 15:36, Jason Gunthorpe wrote:
>>>>> On Wed, Oct 30, 2024 at 11:15:02PM +0800, Zhangfei Gao wrote:
>>>> but unexpected set backs unrelated to work delay some of my plans for
>>>> qemu 9.2.
>>>> I expect to resume in few weeks. I can point you to a branch while I don't
>>>> submit (provided soft-freeze is coming)
>>>
>>> Also, I think we need a mechanism for page fault handling in case Guest
>> handles
>>> the stage 1 plus dirty tracking for stage 1 as well.
>>>
>>
>> I have emulation for x86 iommus to dirty tracking, but that is unrelated to
>> L0
>> live migration -- It's more for testing in the lack of recent hardware. Even
>> emulated page fault handling doesn't affect this unless you have to re-
>> map/map
>> new IOVA, which would also be covered in this series I think.
>>
>> Unless you are talking about physical IOPF that qemu may terminate,
>> though we
>> don't have such support in Qemu atm.
> 
> Yeah I was referring to ARM SMMUv3 cases, where we need nested-smmuv3
> support for vfio-pci assignment. Another usecase we have is support SVA in
> Guest,  with hardware capable  of physical IOPF.
> 
> I will take a look at your series above and will see what else is required> to
support ARM. Please CC if you plan to respin or have a latest branch.
> Thanks for your efforts.

Right, the series about works for emulated intel-iommu and I had some patches
for virtio-iommu (which got simpler thanks to Eric's work on aw-bits). amd-iommu
is easily added too (but it needs to work with guest non-passthrough mode first
before we get there). The only thing you need to do in iommu emulation side is
to expose IOMMU_ATTR_MAX_IOVA[0], and it should be all. For guests hw-nesting /
iopf I don't think I see it affected considering the GPA space remains the same
and we still have a parent pagetable to get the stage 2 A/D bits (like non
nested case).

[0]
https://lore.kernel.org/qemu-devel/20230622214845.3980-11-joao.m.martins@oracle.com/

	Joao