mbox series

[v8,0/3] mm/gup: disallow GUP writing to file-backed mappings by default

Message ID cover.1683067198.git.lstoakes@gmail.com (mailing list archive)
Headers show
Series mm/gup: disallow GUP writing to file-backed mappings by default | expand

Message

Lorenzo Stoakes May 2, 2023, 10:51 p.m. UTC
Writing to file-backed mappings which require folio dirty tracking using
GUP is a fundamentally broken operation, as kernel write access to GUP
mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not
cause write notify to trigger, nor does it enforce that the caller marks
the folio dirty.

The problem arises when, after an initial write to the folio, writeback
results in the folio being cleaned and then the caller, via the GUP
interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no
write notify will occur, and if the caller does mark the folio dirty, this
will be done so unexpectedly.

For example, consider the following scenario:-

1. A folio is written to via GUP which write-faults the memory, notifying
   the file system and dirtying the folio.
2. Later, writeback is triggered, resulting in the folio being cleaned and
   the PTE being marked read-only.
3. The GUP caller writes to the folio, as it is mapped read/write via the
   direct mapping.
4. The GUP caller, now done with the page, unpins it and sets it dirty
   (though it does not have to).

This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
pin_user_pages_fast_only() does not exist, we can rely on a slightly
imperfect whitelisting in the PUP-fast case and fall back to the slow case
should this fail.

v8:
- Fixed typo writeable -> writable.
- Fixed bug in writable_file_mapping_allowed() - must check combination of
  FOLL_PIN AND FOLL_LONGTERM not either/or.
- Updated vma_needs_dirty_tracking() to include write/shared to account for
  MAP_PRIVATE mappings.
- Move to open-coding the checks in folio_pin_allowed() so we can
  READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
  account for fact we now check flags here.
- Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
  anon. Defer to slow path.
- Perform GUP-fast check _after_ the lowest page table level is confirmed to
  be stable.
- Updated comments and commit message for final patch as per Jason's
  suggestions.

v7:
- Fixed very silly bug in writeable_file_mapping_allowed() inverting the
  logic.
- Removed unnecessary RCU lock code and replaced with adaptation of Peter's
  idea.
- Removed unnecessary open-coded folio_test_anon() in
  folio_longterm_write_pin_allowed() and restructured to generally permit
  NULL folio_mapping().
https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com/

v6:
- Rebased on latest mm-unstable as of 28th April 2023.
- Add PUP-fast check with handling for rcu-locked TLB shootdown to synchronise
  correctly.
- Split patch series into 3 to make it more digestible.
https://lore.kernel.org/all/cover.1682981880.git.lstoakes@gmail.com/

v5:
- Rebased on latest mm-unstable as of 25th April 2023.
- Some small refactorings suggested by John.
- Added an extended description of the problem in the comment around
  writeable_file_mapping_allowed() for clarity.
- Updated commit message as suggested by Mika and John.
https://lore.kernel.org/all/6b73e692c2929dc4613af711bdf92e2ec1956a66.1682638385.git.lstoakes@gmail.com/

v4:
- Split out vma_needs_dirty_tracking() from vma_wants_writenotify() to
  reduce duplication and update to use this in the GUP check. Note that
  both separately check vm_ops_needs_writenotify() as the latter needs to
  test this before the vm_pgprot_modify() test, resulting in
  vma_wants_writenotify() checking this twice, however it is such a small
  check this should not be egregious.
https://lore.kernel.org/all/3b92d56f55671a0389252379237703df6e86ea48.1682464032.git.lstoakes@gmail.com/

v3:
- Rebased on latest mm-unstable as of 24th April 2023.
- Explicitly check whether file system requires folio dirtying. Note that
  vma_wants_writenotify() could not be used directly as it is very much focused
  on determining if the PTE r/w should be set (e.g. assuming private mapping
  does not require it as already set, soft dirty considerations).
- Tested code against shmem and hugetlb mappings - confirmed that these are not
  disallowed by the check.
- Eliminate FOLL_ALLOW_BROKEN_FILE_MAPPING flag and instead perform check only
  for FOLL_LONGTERM pins.
- As a result, limit check to internal GUP code.
 https://lore.kernel.org/all/23c19e27ef0745f6d3125976e047ee0da62569d4.1682406295.git.lstoakes@gmail.com/

v2:
- Add accidentally excluded ptrace_access_vm() use of
  FOLL_ALLOW_BROKEN_FILE_MAPPING.
- Tweak commit message.
https://lore.kernel.org/all/c8ee7e02d3d4f50bb3e40855c53bda39eec85b7d.1682321768.git.lstoakes@gmail.com/

v1:
https://lore.kernel.org/all/f86dc089b460c80805e321747b0898fd1efe93d7.1682168199.git.lstoakes@gmail.com/

Lorenzo Stoakes (3):
  mm/mmap: separate writenotify and dirty tracking logic
  mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed
    mappings
  mm/gup: disallow FOLL_LONGTERM GUP-fast writing to file-backed
    mappings

 include/linux/mm.h |   1 +
 mm/gup.c           | 146 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/mmap.c          |  53 ++++++++++++----
 3 files changed, 186 insertions(+), 14 deletions(-)

--
2.40.1

Comments

Matthew Rosato May 3, 2023, 12:31 a.m. UTC | #1
On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
> Writing to file-backed mappings which require folio dirty tracking using
> GUP is a fundamentally broken operation, as kernel write access to GUP
> mappings do not adhere to the semantics expected by a file system.
> 
> A GUP caller uses the direct mapping to access the folio, which does not
> cause write notify to trigger, nor does it enforce that the caller marks
> the folio dirty.
> 
> The problem arises when, after an initial write to the folio, writeback
> results in the folio being cleaned and then the caller, via the GUP
> interface, writes to the folio again.
> 
> As a result of the use of this secondary, direct, mapping to the folio no
> write notify will occur, and if the caller does mark the folio dirty, this
> will be done so unexpectedly.
> 
> For example, consider the following scenario:-
> 
> 1. A folio is written to via GUP which write-faults the memory, notifying
>    the file system and dirtying the folio.
> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>    the PTE being marked read-only.
> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>    direct mapping.
> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>    (though it does not have to).
> 
> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
> pin_user_pages_fast_only() does not exist, we can rely on a slightly
> imperfect whitelisting in the PUP-fast case and fall back to the slow case
> should this fail.
> 
> v8:
> - Fixed typo writeable -> writable.
> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>   FOLL_PIN AND FOLL_LONGTERM not either/or.
> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>   MAP_PRIVATE mappings.
> - Move to open-coding the checks in folio_pin_allowed() so we can
>   READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>   account for fact we now check flags here.
> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>   anon. Defer to slow path.
> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>   be stable.
> - Updated comments and commit message for final patch as per Jason's
>   suggestions.

Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
David Hildenbrand May 3, 2023, 7:08 a.m. UTC | #2
On 03.05.23 02:31, Matthew Rosato wrote:
> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>> Writing to file-backed mappings which require folio dirty tracking using
>> GUP is a fundamentally broken operation, as kernel write access to GUP
>> mappings do not adhere to the semantics expected by a file system.
>>
>> A GUP caller uses the direct mapping to access the folio, which does not
>> cause write notify to trigger, nor does it enforce that the caller marks
>> the folio dirty.
>>
>> The problem arises when, after an initial write to the folio, writeback
>> results in the folio being cleaned and then the caller, via the GUP
>> interface, writes to the folio again.
>>
>> As a result of the use of this secondary, direct, mapping to the folio no
>> write notify will occur, and if the caller does mark the folio dirty, this
>> will be done so unexpectedly.
>>
>> For example, consider the following scenario:-
>>
>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>     the file system and dirtying the folio.
>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>     the PTE being marked read-only.
>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>     direct mapping.
>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>     (though it does not have to).
>>
>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>> should this fail.
>>
>> v8:
>> - Fixed typo writeable -> writable.
>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>    FOLL_PIN AND FOLL_LONGTERM not either/or.
>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>    MAP_PRIVATE mappings.
>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>    READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>    account for fact we now check flags here.
>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>    anon. Defer to slow path.
>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>    be stable.
>> - Updated comments and commit message for final patch as per Jason's
>>    suggestions.
> 
> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)

At least it's consistent now ;) And it might be working as expected ...

In v7:
* pin_user_pages_fast() succeeded
* vfio_pin_pages_remote() failed

But also in v7:
* GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
   mappings
* Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings

In v8:
* pin_user_pages_fast() succeeds
* vfio_pin_pages_remote() succeeds

But also in v8:
* GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
   mappings
* Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings


I have to speculate, but ... could it be that you are using a private 
mapping?

In QEMU, unfortunately, the default for memory-backend-file is 
"share=off" (private) ... for memory-backend-memfd it is "share=on" 
(shared). The default is stupid ...

If you invoke QEMU manually, can you specify "share=on" for the 
memory-backend-file? I thought libvirt would always default to 
"share=on" for file mappings (everything else doesn't make much sense) 
... but you might have to specify
	<access mode="shared"/>
in addition to
	<source type="file"/>
Matthew Rosato May 3, 2023, 11:25 a.m. UTC | #3
On 5/3/23 3:08 AM, David Hildenbrand wrote:
> On 03.05.23 02:31, Matthew Rosato wrote:
>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>> Writing to file-backed mappings which require folio dirty tracking using
>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>> mappings do not adhere to the semantics expected by a file system.
>>>
>>> A GUP caller uses the direct mapping to access the folio, which does not
>>> cause write notify to trigger, nor does it enforce that the caller marks
>>> the folio dirty.
>>>
>>> The problem arises when, after an initial write to the folio, writeback
>>> results in the folio being cleaned and then the caller, via the GUP
>>> interface, writes to the folio again.
>>>
>>> As a result of the use of this secondary, direct, mapping to the folio no
>>> write notify will occur, and if the caller does mark the folio dirty, this
>>> will be done so unexpectedly.
>>>
>>> For example, consider the following scenario:-
>>>
>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>>     the file system and dirtying the folio.
>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>     the PTE being marked read-only.
>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>     direct mapping.
>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>     (though it does not have to).
>>>
>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>> should this fail.
>>>
>>> v8:
>>> - Fixed typo writeable -> writable.
>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>>    FOLL_PIN AND FOLL_LONGTERM not either/or.
>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>>    MAP_PRIVATE mappings.
>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>>    READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>>    account for fact we now check flags here.
>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>>    anon. Defer to slow path.
>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>>    be stable.
>>> - Updated comments and commit message for final patch as per Jason's
>>>    suggestions.
>>
>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
> 
> At least it's consistent now ;) And it might be working as expected ...
> 
> In v7:
> * pin_user_pages_fast() succeeded
> * vfio_pin_pages_remote() failed
> 
> But also in v7:
> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>   mappings
> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
> 
> In v8:
> * pin_user_pages_fast() succeeds
> * vfio_pin_pages_remote() succeeds
> 
> But also in v8:
> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>   mappings
> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
> 
> 
> I have to speculate, but ... could it be that you are using a private mapping?
> 
> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
> 
> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
>     <access mode="shared"/>
> in addition to
>     <source type="file"/>
> 

Ah, there we go.  Yes, I was using the default of share=off.  When I instead specify share=on, now the pins will fail in both cases.
David Hildenbrand May 3, 2023, 12:53 p.m. UTC | #4
On 03.05.23 13:25, Matthew Rosato wrote:
> On 5/3/23 3:08 AM, David Hildenbrand wrote:
>> On 03.05.23 02:31, Matthew Rosato wrote:
>>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>>> Writing to file-backed mappings which require folio dirty tracking using
>>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>>> mappings do not adhere to the semantics expected by a file system.
>>>>
>>>> A GUP caller uses the direct mapping to access the folio, which does not
>>>> cause write notify to trigger, nor does it enforce that the caller marks
>>>> the folio dirty.
>>>>
>>>> The problem arises when, after an initial write to the folio, writeback
>>>> results in the folio being cleaned and then the caller, via the GUP
>>>> interface, writes to the folio again.
>>>>
>>>> As a result of the use of this secondary, direct, mapping to the folio no
>>>> write notify will occur, and if the caller does mark the folio dirty, this
>>>> will be done so unexpectedly.
>>>>
>>>> For example, consider the following scenario:-
>>>>
>>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>>>      the file system and dirtying the folio.
>>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>>      the PTE being marked read-only.
>>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>>      direct mapping.
>>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>>      (though it does not have to).
>>>>
>>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>>> should this fail.
>>>>
>>>> v8:
>>>> - Fixed typo writeable -> writable.
>>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>>>     FOLL_PIN AND FOLL_LONGTERM not either/or.
>>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>>>     MAP_PRIVATE mappings.
>>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>>>     READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>>>     account for fact we now check flags here.
>>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>>>     anon. Defer to slow path.
>>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>>>     be stable.
>>>> - Updated comments and commit message for final patch as per Jason's
>>>>     suggestions.
>>>
>>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
>>
>> At least it's consistent now ;) And it might be working as expected ...
>>
>> In v7:
>> * pin_user_pages_fast() succeeded
>> * vfio_pin_pages_remote() failed
>>
>> But also in v7:
>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>    mappings
>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>
>> In v8:
>> * pin_user_pages_fast() succeeds
>> * vfio_pin_pages_remote() succeeds
>>
>> But also in v8:
>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>    mappings
>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>
>>
>> I have to speculate, but ... could it be that you are using a private mapping?
>>
>> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
>>
>> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
>>      <access mode="shared"/>
>> in addition to
>>      <source type="file"/>
>>
> 
> Ah, there we go.  Yes, I was using the default of share=off.  When I instead specify share=on, now the pins will fail in both cases.
> 

Out of curiosity, how does that manifest?

I assume the VM is successfully created and as Linux tries initializing 
and using the device, we get a bunch of errors inside the VM, correct?
Matthew Rosato May 3, 2023, 1:24 p.m. UTC | #5
On 5/3/23 8:53 AM, David Hildenbrand wrote:
> On 03.05.23 13:25, Matthew Rosato wrote:
>> On 5/3/23 3:08 AM, David Hildenbrand wrote:
>>> On 03.05.23 02:31, Matthew Rosato wrote:
>>>> On 5/2/23 6:51 PM, Lorenzo Stoakes wrote:
>>>>> Writing to file-backed mappings which require folio dirty tracking using
>>>>> GUP is a fundamentally broken operation, as kernel write access to GUP
>>>>> mappings do not adhere to the semantics expected by a file system.
>>>>>
>>>>> A GUP caller uses the direct mapping to access the folio, which does not
>>>>> cause write notify to trigger, nor does it enforce that the caller marks
>>>>> the folio dirty.
>>>>>
>>>>> The problem arises when, after an initial write to the folio, writeback
>>>>> results in the folio being cleaned and then the caller, via the GUP
>>>>> interface, writes to the folio again.
>>>>>
>>>>> As a result of the use of this secondary, direct, mapping to the folio no
>>>>> write notify will occur, and if the caller does mark the folio dirty, this
>>>>> will be done so unexpectedly.
>>>>>
>>>>> For example, consider the following scenario:-
>>>>>
>>>>> 1. A folio is written to via GUP which write-faults the memory, notifying
>>>>>      the file system and dirtying the folio.
>>>>> 2. Later, writeback is triggered, resulting in the folio being cleaned and
>>>>>      the PTE being marked read-only.
>>>>> 3. The GUP caller writes to the folio, as it is mapped read/write via the
>>>>>      direct mapping.
>>>>> 4. The GUP caller, now done with the page, unpins it and sets it dirty
>>>>>      (though it does not have to).
>>>>>
>>>>> This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As
>>>>> pin_user_pages_fast_only() does not exist, we can rely on a slightly
>>>>> imperfect whitelisting in the PUP-fast case and fall back to the slow case
>>>>> should this fail.
>>>>>
>>>>> v8:
>>>>> - Fixed typo writeable -> writable.
>>>>> - Fixed bug in writable_file_mapping_allowed() - must check combination of
>>>>>     FOLL_PIN AND FOLL_LONGTERM not either/or.
>>>>> - Updated vma_needs_dirty_tracking() to include write/shared to account for
>>>>>     MAP_PRIVATE mappings.
>>>>> - Move to open-coding the checks in folio_pin_allowed() so we can
>>>>>     READ_ONCE() the mapping and avoid unexpected compiler loads. Rename to
>>>>>     account for fact we now check flags here.
>>>>> - Disallow mapping == NULL or mapping & PAGE_MAPPING_FLAGS other than
>>>>>     anon. Defer to slow path.
>>>>> - Perform GUP-fast check _after_ the lowest page table level is confirmed to
>>>>>     be stable.
>>>>> - Updated comments and commit message for final patch as per Jason's
>>>>>     suggestions.
>>>>
>>>> Tested again on s390 using QEMU with a memory backend file (on ext4) and vfio-pci -- This time both vfio_pin_pages_remote (which will call pin_user_pages_remote(flags | FOLL_LONGTERM)) and the pin_user_pages_fast(FOLL_WRITE | FOLL_LONGTERM) in kvm_s390_pci_aif_enable are being allowed (e.g. returning positive pin count)
>>>
>>> At least it's consistent now ;) And it might be working as expected ...
>>>
>>> In v7:
>>> * pin_user_pages_fast() succeeded
>>> * vfio_pin_pages_remote() failed
>>>
>>> But also in v7:
>>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>>    mappings
>>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>>
>>> In v8:
>>> * pin_user_pages_fast() succeeds
>>> * vfio_pin_pages_remote() succeeds
>>>
>>> But also in v8:
>>> * GUP-fast allows pinning (anonymous) pages in MAP_PRIVATE file
>>>    mappings
>>> * Ordinary GUP allows pinning pages in MAP_PRIVATE file mappings
>>>
>>>
>>> I have to speculate, but ... could it be that you are using a private mapping?
>>>
>>> In QEMU, unfortunately, the default for memory-backend-file is "share=off" (private) ... for memory-backend-memfd it is "share=on" (shared). The default is stupid ...
>>>
>>> If you invoke QEMU manually, can you specify "share=on" for the memory-backend-file? I thought libvirt would always default to "share=on" for file mappings (everything else doesn't make much sense) ... but you might have to specify
>>>      <access mode="shared"/>
>>> in addition to
>>>      <source type="file"/>
>>>
>>
>> Ah, there we go.  Yes, I was using the default of share=off.  When I instead specify share=on, now the pins will fail in both cases.
>>
> 
> Out of curiosity, how does that manifest?
> 
> I assume the VM is successfully created and as Linux tries initializing and using the device, we get a bunch of errors inside the VM, correct?
> 

Yes, that's correct.

Which error comes first (an attempt at mapping something via type1 iommu or an attempt to register AEN) depends on the device type and the order of operations of the associated driver.  But in either case, you're going to see guest errors associated with that action.  mlx5 and ism give up rather quickly and just fail their probe. nvme in the guest is persistent and its actions keep re-attempting to setup AEN by issuing the associated instruction; but the associated blockdev will never show up.