mbox series

[0/6] rust: page: Support borrowing `struct page` and physaddr conversion

Message ID 20250202-rust-page-v1-0-e3170d7fe55e@asahilina.net (mailing list archive)
Headers show
Series rust: page: Support borrowing `struct page` and physaddr conversion | expand

Message

Asahi Lina Feb. 2, 2025, 1:05 p.m. UTC
This series refactors the existing Page wrapper to support borrowing
`struct page` objects without ownership on the Rust side, and converting
page references to/from physical memory addresses.

The series overlaps with the earlier submission in [1] and follows a
different approach, based on the discussion that happened there.

The primary use case for this is implementing IOMMU-style page table
management in Rust. This allows drivers for IOMMUs and MMU-containing
SoC devices to be written in Rust (such as embedded GPUs). The intended
logic is similar to how ARM SMMU page tables are managed in the
drivers/iommu tree.

First, introduce a concept of Owned<T> and an Ownable trait. These are
similar to ARef<T> and AlwaysRefCounted, but are used for types which
are not ref counted but rather have a single intended owner.

Then, refactor the existing Page support to use the new mechanism. Pages
returned from the page allocator are not intended to be ref counted by
consumers (see previous discussion in [1]), so this keeps Rust's view of
page ownership as a simple "owned or not". Of course, this is still
composable as Arc<Owned<Page>> if Rust code needs to reference count its
own Page allocations for whatever reason.

Then, make some existing private methods public, since this is needed to
reasonably use allocated pages as IOMMU page tables.

Along the way we also add a small module to represent a core kernel
address types (PhysicalAddr, DmaAddr, ResourceSize, Pfn). In the future,
this might grow with helpers to make address math safer and more
Rust-like.

Finally, add methods to:
- Get a page's physical address
- Convert an owned Page into its physical address
- Convert a physical address back to its owned Page
- Borrow a Page from a physical address, in both checked (with checks
  that a struct page exists and is accessible as regular RAM) and not
  checked forms (useful when the user knows the physaddr is valid,
  for example because it got it from Page::into_phys()).

Of course, all but the first two have to be `unsafe` by nature, but that
comes with the territory of writing low level memory management code.

These methods allow page table code to know the physical address of
pages (needed to build intermediate level PTEs) and to essentially
transfer ownership of the pages into the page table structure itself,
and back into Page objects when freeing page tables. Without that, the
code would have to keep track of page allocations in duplicate, once in
Rust code and once in the page table structure itself, which is less
desirable.

For Apple GPUs, the address space shared between firmware and the driver
is actually pre-allocated by the bootloader, with the top level page
table already pre-allocated, and the firmware owning some PTEs within it
while the kernel populates others. This cooperation works well when the
kernel can reference this top level page table by physical address. The
only thing the driver needs to ensure is that it never attempts to free
it in this case, nor the page tables corresponding to virtual address
ranges it doesn't own. Without the ability to just borrow the
pre-allocated top level page and access it, the driver would have to
special-case this and manually manage the top level PTEs outside the
main page table code, as well as introduce different page table
configurations with different numbers of levels so the kernel's view is
one lever shallower.

The physical address borrow feature is also useful to generate virtual
address space dumps for crash dumps, including firmware pages. The
intent is that firmware pages are configured in the Device Tree as
reserved System RAM (without no-map), which creates struct page objects
for them and makes them available in the kernel's direct map. Then the
driver's page table code can walk the page tables and make a snapshot of
the entire address space, including firmware code and data pages,
pre-allocated shared segments, and driver-allocated objects (which are
GEM objects), again without special casing anything. The checks in
`Page::borrow_phys()` should ensure that the page is safe to access as
RAM, so this will skip MMIO pages and anything that wasn't declared to
the kernel in the DT.

Example usage:
https://github.com/AsahiLinux/linux/blob/gpu/rust-wip/drivers/gpu/drm/asahi/pgtable.rs

The last patch is a minor cleanup to the Page abstraction noticed while
preparing this series.

[1] https://lore.kernel.org/lkml/20241119112408.779243-1-abdiel.janulgue@gmail.com/T/#u

Signed-off-by: Asahi Lina <lina@asahilina.net>
---
Asahi Lina (6):
      rust: types: Add Ownable/Owned types
      rust: page: Convert to Ownable
      rust: page: Make with_page_mapped() and with_pointer_into_page() public
      rust: addr: Add a module to declare core address types
      rust: page: Add physical address conversion functions
      rust: page: Make Page::as_ptr() pub(crate)

 rust/helpers/page.c  |  26 ++++++++++++
 rust/kernel/addr.rs  |  15 +++++++
 rust/kernel/lib.rs   |   1 +
 rust/kernel/page.rs  | 101 ++++++++++++++++++++++++++++++++++++++--------
 rust/kernel/types.rs | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 236 insertions(+), 17 deletions(-)
---
base-commit: ffd294d346d185b70e28b1a28abe367bbfe53c04
change-id: 20250202-rust-page-80892069fc78

Cheers,
~~ Lina

Comments

Simona Vetter Feb. 3, 2025, 9:58 a.m. UTC | #1
On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
> This series refactors the existing Page wrapper to support borrowing
> `struct page` objects without ownership on the Rust side, and converting
> page references to/from physical memory addresses.
> 
> The series overlaps with the earlier submission in [1] and follows a
> different approach, based on the discussion that happened there.
> 
> The primary use case for this is implementing IOMMU-style page table
> management in Rust. This allows drivers for IOMMUs and MMU-containing
> SoC devices to be written in Rust (such as embedded GPUs). The intended
> logic is similar to how ARM SMMU page tables are managed in the
> drivers/iommu tree.
> 
> First, introduce a concept of Owned<T> and an Ownable trait. These are
> similar to ARef<T> and AlwaysRefCounted, but are used for types which
> are not ref counted but rather have a single intended owner.
> 
> Then, refactor the existing Page support to use the new mechanism. Pages
> returned from the page allocator are not intended to be ref counted by
> consumers (see previous discussion in [1]), so this keeps Rust's view of
> page ownership as a simple "owned or not". Of course, this is still
> composable as Arc<Owned<Page>> if Rust code needs to reference count its
> own Page allocations for whatever reason.

I think there's a bit a potential mess here because the conversion to
folios isn't far enough yet that we can entirely ignore page refcounts and
just use folio refcounts. But I guess we can deal with that oddity if we
hit it (maybe folio conversion moves fast enough), since this only really
starts to become relevant for hmm/svm gpu stuff.

iow I think anticipating the future where struct page really doesn't have
a refcount is the right move. Aside from that it's really not a refcount
that works in the rust ARef sense, since struct page cannot disappear for
system memory, and for dev_pagemap memory it's an entirely different
reference you need (and then there's a few more special cases).

For dma/iommu stuff there's also a push to move towards pfn + metadata
model, so that p2pdma doesn't need struct page. But I haven't looked into
that much yet.

Cheers, Sima

> Then, make some existing private methods public, since this is needed to
> reasonably use allocated pages as IOMMU page tables.
> 
> Along the way we also add a small module to represent a core kernel
> address types (PhysicalAddr, DmaAddr, ResourceSize, Pfn). In the future,
> this might grow with helpers to make address math safer and more
> Rust-like.
> 
> Finally, add methods to:
> - Get a page's physical address
> - Convert an owned Page into its physical address
> - Convert a physical address back to its owned Page
> - Borrow a Page from a physical address, in both checked (with checks
>   that a struct page exists and is accessible as regular RAM) and not
>   checked forms (useful when the user knows the physaddr is valid,
>   for example because it got it from Page::into_phys()).
> 
> Of course, all but the first two have to be `unsafe` by nature, but that
> comes with the territory of writing low level memory management code.
> 
> These methods allow page table code to know the physical address of
> pages (needed to build intermediate level PTEs) and to essentially
> transfer ownership of the pages into the page table structure itself,
> and back into Page objects when freeing page tables. Without that, the
> code would have to keep track of page allocations in duplicate, once in
> Rust code and once in the page table structure itself, which is less
> desirable.
> 
> For Apple GPUs, the address space shared between firmware and the driver
> is actually pre-allocated by the bootloader, with the top level page
> table already pre-allocated, and the firmware owning some PTEs within it
> while the kernel populates others. This cooperation works well when the
> kernel can reference this top level page table by physical address. The
> only thing the driver needs to ensure is that it never attempts to free
> it in this case, nor the page tables corresponding to virtual address
> ranges it doesn't own. Without the ability to just borrow the
> pre-allocated top level page and access it, the driver would have to
> special-case this and manually manage the top level PTEs outside the
> main page table code, as well as introduce different page table
> configurations with different numbers of levels so the kernel's view is
> one lever shallower.
> 
> The physical address borrow feature is also useful to generate virtual
> address space dumps for crash dumps, including firmware pages. The
> intent is that firmware pages are configured in the Device Tree as
> reserved System RAM (without no-map), which creates struct page objects
> for them and makes them available in the kernel's direct map. Then the
> driver's page table code can walk the page tables and make a snapshot of
> the entire address space, including firmware code and data pages,
> pre-allocated shared segments, and driver-allocated objects (which are
> GEM objects), again without special casing anything. The checks in
> `Page::borrow_phys()` should ensure that the page is safe to access as
> RAM, so this will skip MMIO pages and anything that wasn't declared to
> the kernel in the DT.
> 
> Example usage:
> https://github.com/AsahiLinux/linux/blob/gpu/rust-wip/drivers/gpu/drm/asahi/pgtable.rs
> 
> The last patch is a minor cleanup to the Page abstraction noticed while
> preparing this series.
> 
> [1] https://lore.kernel.org/lkml/20241119112408.779243-1-abdiel.janulgue@gmail.com/T/#u
> 
> Signed-off-by: Asahi Lina <lina@asahilina.net>
> ---
> Asahi Lina (6):
>       rust: types: Add Ownable/Owned types
>       rust: page: Convert to Ownable
>       rust: page: Make with_page_mapped() and with_pointer_into_page() public
>       rust: addr: Add a module to declare core address types
>       rust: page: Add physical address conversion functions
>       rust: page: Make Page::as_ptr() pub(crate)
> 
>  rust/helpers/page.c  |  26 ++++++++++++
>  rust/kernel/addr.rs  |  15 +++++++
>  rust/kernel/lib.rs   |   1 +
>  rust/kernel/page.rs  | 101 ++++++++++++++++++++++++++++++++++++++--------
>  rust/kernel/types.rs | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 236 insertions(+), 17 deletions(-)
> ---
> base-commit: ffd294d346d185b70e28b1a28abe367bbfe53c04
> change-id: 20250202-rust-page-80892069fc78
> 
> Cheers,
> ~~ Lina
>
Danilo Krummrich Feb. 3, 2025, 10:22 a.m. UTC | #2
Hi Lina,

On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
> This series refactors the existing Page wrapper to support borrowing
> `struct page` objects without ownership on the Rust side, and converting
> page references to/from physical memory addresses.

Thanks for picking this up!

As you know, this has been previously worked on by Abdiel. Kindly drop a note
if you intend to pick up something up next time, such that we don't end up doing
duplicated work.

- Danilo
Asahi Lina Feb. 3, 2025, 2:32 p.m. UTC | #3
On 2/3/25 6:58 PM, Simona Vetter wrote:
> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>> This series refactors the existing Page wrapper to support borrowing
>> `struct page` objects without ownership on the Rust side, and converting
>> page references to/from physical memory addresses.
>>
>> The series overlaps with the earlier submission in [1] and follows a
>> different approach, based on the discussion that happened there.
>>
>> The primary use case for this is implementing IOMMU-style page table
>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>> logic is similar to how ARM SMMU page tables are managed in the
>> drivers/iommu tree.
>>
>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>> are not ref counted but rather have a single intended owner.
>>
>> Then, refactor the existing Page support to use the new mechanism. Pages
>> returned from the page allocator are not intended to be ref counted by
>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>> page ownership as a simple "owned or not". Of course, this is still
>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>> own Page allocations for whatever reason.
> 
> I think there's a bit a potential mess here because the conversion to
> folios isn't far enough yet that we can entirely ignore page refcounts and
> just use folio refcounts. But I guess we can deal with that oddity if we
> hit it (maybe folio conversion moves fast enough), since this only really
> starts to become relevant for hmm/svm gpu stuff.
> 
> iow I think anticipating the future where struct page really doesn't have
> a refcount is the right move. Aside from that it's really not a refcount
> that works in the rust ARef sense, since struct page cannot disappear for
> system memory, and for dev_pagemap memory it's an entirely different
> reference you need (and then there's a few more special cases).

Right, as far as this abstraction is concerned, all that needs to hold
is that:

- alloc_pages() and __free_pages() work as intended, however that may
be, to reserve and return one page (for now, though I think extending
the Rust abstraction to handle higher-order folios is pretty easy, but
that can happen later).
- Whatever borrows pages knows what it's doing. In this case there's
only support for borrowing pages by physaddr, and it's only going to be
used in a driver for a platform without memory hot remove (so far) and
only for pages which have known usage (in principle) and are either
explicitly allocated or known pinned or reserved, so it's not a problem
right now. Future abstractions that return borrowed pages can do their
own locking/bookkeeping/whatever is necessary to keep it safe.

I would like to hear how memory hot-remove is supposed to work though,
to see if we should be doing something to make the abstraction safer
(though it's still unsafe and always will be). Is there a chance a
`struct page` could vanish out from under us under some conditions?

For dev_pagemap memory I imagine we'd have an entirely different
abstraction wrapping that, that can just return a borrowed &Page to give
the user access to page operations without going through Owned<Page>.

> For dma/iommu stuff there's also a push to move towards pfn + metadata
> model, so that p2pdma doesn't need struct page. But I haven't looked into
> that much yet.

Yeah, I don't know how that stuff works...

> 
> Cheers, Sima
> 
>> Then, make some existing private methods public, since this is needed to
>> reasonably use allocated pages as IOMMU page tables.
>>
>> Along the way we also add a small module to represent a core kernel
>> address types (PhysicalAddr, DmaAddr, ResourceSize, Pfn). In the future,
>> this might grow with helpers to make address math safer and more
>> Rust-like.
>>
>> Finally, add methods to:
>> - Get a page's physical address
>> - Convert an owned Page into its physical address
>> - Convert a physical address back to its owned Page
>> - Borrow a Page from a physical address, in both checked (with checks
>>   that a struct page exists and is accessible as regular RAM) and not
>>   checked forms (useful when the user knows the physaddr is valid,
>>   for example because it got it from Page::into_phys()).
>>
>> Of course, all but the first two have to be `unsafe` by nature, but that
>> comes with the territory of writing low level memory management code.
>>
>> These methods allow page table code to know the physical address of
>> pages (needed to build intermediate level PTEs) and to essentially
>> transfer ownership of the pages into the page table structure itself,
>> and back into Page objects when freeing page tables. Without that, the
>> code would have to keep track of page allocations in duplicate, once in
>> Rust code and once in the page table structure itself, which is less
>> desirable.
>>
>> For Apple GPUs, the address space shared between firmware and the driver
>> is actually pre-allocated by the bootloader, with the top level page
>> table already pre-allocated, and the firmware owning some PTEs within it
>> while the kernel populates others. This cooperation works well when the
>> kernel can reference this top level page table by physical address. The
>> only thing the driver needs to ensure is that it never attempts to free
>> it in this case, nor the page tables corresponding to virtual address
>> ranges it doesn't own. Without the ability to just borrow the
>> pre-allocated top level page and access it, the driver would have to
>> special-case this and manually manage the top level PTEs outside the
>> main page table code, as well as introduce different page table
>> configurations with different numbers of levels so the kernel's view is
>> one lever shallower.
>>
>> The physical address borrow feature is also useful to generate virtual
>> address space dumps for crash dumps, including firmware pages. The
>> intent is that firmware pages are configured in the Device Tree as
>> reserved System RAM (without no-map), which creates struct page objects
>> for them and makes them available in the kernel's direct map. Then the
>> driver's page table code can walk the page tables and make a snapshot of
>> the entire address space, including firmware code and data pages,
>> pre-allocated shared segments, and driver-allocated objects (which are
>> GEM objects), again without special casing anything. The checks in
>> `Page::borrow_phys()` should ensure that the page is safe to access as
>> RAM, so this will skip MMIO pages and anything that wasn't declared to
>> the kernel in the DT.
>>
>> Example usage:
>> https://github.com/AsahiLinux/linux/blob/gpu/rust-wip/drivers/gpu/drm/asahi/pgtable.rs
>>
>> The last patch is a minor cleanup to the Page abstraction noticed while
>> preparing this series.
>>
>> [1] https://lore.kernel.org/lkml/20241119112408.779243-1-abdiel.janulgue@gmail.com/T/#u
>>
>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>> ---
>> Asahi Lina (6):
>>       rust: types: Add Ownable/Owned types
>>       rust: page: Convert to Ownable
>>       rust: page: Make with_page_mapped() and with_pointer_into_page() public
>>       rust: addr: Add a module to declare core address types
>>       rust: page: Add physical address conversion functions
>>       rust: page: Make Page::as_ptr() pub(crate)
>>
>>  rust/helpers/page.c  |  26 ++++++++++++
>>  rust/kernel/addr.rs  |  15 +++++++
>>  rust/kernel/lib.rs   |   1 +
>>  rust/kernel/page.rs  | 101 ++++++++++++++++++++++++++++++++++++++--------
>>  rust/kernel/types.rs | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 236 insertions(+), 17 deletions(-)
>> ---
>> base-commit: ffd294d346d185b70e28b1a28abe367bbfe53c04
>> change-id: 20250202-rust-page-80892069fc78
>>
>> Cheers,
>> ~~ Lina
>>
> 

~~ Lina
Asahi Lina Feb. 3, 2025, 2:41 p.m. UTC | #4
On 2/3/25 7:22 PM, Danilo Krummrich wrote:
> Hi Lina,
> 
> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>> This series refactors the existing Page wrapper to support borrowing
>> `struct page` objects without ownership on the Rust side, and converting
>> page references to/from physical memory addresses.
> 
> Thanks for picking this up!
> 
> As you know, this has been previously worked on by Abdiel. Kindly drop a note
> if you intend to pick up something up next time, such that we don't end up doing
> duplicated work.

Sorry! I wasn't sure if this was going to end up submitted over the
holidays so I wasn't in a rush, but I ended up switching gears in the
past couple of weeks and I really needed this feature now (for crashdump
support to debug a really annoying firmware crash).

I actually only just noticed that the previous v2 already had
Owned/Ownable... I only saw the v3 which didn't, so I didn't realize
there was already a version of that approach in the past.

~~ Lina
Zi Yan Feb. 3, 2025, 9:05 p.m. UTC | #5
On 3 Feb 2025, at 9:32, Asahi Lina wrote:

> On 2/3/25 6:58 PM, Simona Vetter wrote:
>> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>>> This series refactors the existing Page wrapper to support borrowing
>>> `struct page` objects without ownership on the Rust side, and converting
>>> page references to/from physical memory addresses.
>>>
>>> The series overlaps with the earlier submission in [1] and follows a
>>> different approach, based on the discussion that happened there.
>>>
>>> The primary use case for this is implementing IOMMU-style page table
>>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>>> logic is similar to how ARM SMMU page tables are managed in the
>>> drivers/iommu tree.
>>>
>>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>>> are not ref counted but rather have a single intended owner.
>>>
>>> Then, refactor the existing Page support to use the new mechanism. Pages
>>> returned from the page allocator are not intended to be ref counted by
>>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>>> page ownership as a simple "owned or not". Of course, this is still
>>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>>> own Page allocations for whatever reason.
>>
>> I think there's a bit a potential mess here because the conversion to
>> folios isn't far enough yet that we can entirely ignore page refcounts and
>> just use folio refcounts. But I guess we can deal with that oddity if we
>> hit it (maybe folio conversion moves fast enough), since this only really
>> starts to become relevant for hmm/svm gpu stuff.
>>
>> iow I think anticipating the future where struct page really doesn't have
>> a refcount is the right move. Aside from that it's really not a refcount
>> that works in the rust ARef sense, since struct page cannot disappear for
>> system memory, and for dev_pagemap memory it's an entirely different
>> reference you need (and then there's a few more special cases).
>
> Right, as far as this abstraction is concerned, all that needs to hold
> is that:
>
> - alloc_pages() and __free_pages() work as intended, however that may
> be, to reserve and return one page (for now, though I think extending
> the Rust abstraction to handle higher-order folios is pretty easy, but
> that can happen later).
> - Whatever borrows pages knows what it's doing. In this case there's
> only support for borrowing pages by physaddr, and it's only going to be
> used in a driver for a platform without memory hot remove (so far) and
> only for pages which have known usage (in principle) and are either
> explicitly allocated or known pinned or reserved, so it's not a problem
> right now. Future abstractions that return borrowed pages can do their
> own locking/bookkeeping/whatever is necessary to keep it safe.
>
> I would like to hear how memory hot-remove is supposed to work though,
> to see if we should be doing something to make the abstraction safer
> (though it's still unsafe and always will be). Is there a chance a
> `struct page` could vanish out from under us under some conditions?

Add DavidH and OscarS for memory hot-remove questions.

IIUC, struct page could be freed if a chunk of memory is hot-removed.

Another case struct page can be freed is when hugetlb vmemmap optimization
is used. Muchun (cc'd) is the maintainer of hugetlbfs.

>
> For dev_pagemap memory I imagine we'd have an entirely different
> abstraction wrapping that, that can just return a borrowed &Page to give
> the user access to page operations without going through Owned<Page>.
>
>> For dma/iommu stuff there's also a push to move towards pfn + metadata
>> model, so that p2pdma doesn't need struct page. But I haven't looked into
>> that much yet.
>
> Yeah, I don't know how that stuff works...
>
>>
>> Cheers, Sima
>>
>>> Then, make some existing private methods public, since this is needed to
>>> reasonably use allocated pages as IOMMU page tables.
>>>
>>> Along the way we also add a small module to represent a core kernel
>>> address types (PhysicalAddr, DmaAddr, ResourceSize, Pfn). In the future,
>>> this might grow with helpers to make address math safer and more
>>> Rust-like.
>>>
>>> Finally, add methods to:
>>> - Get a page's physical address
>>> - Convert an owned Page into its physical address
>>> - Convert a physical address back to its owned Page
>>> - Borrow a Page from a physical address, in both checked (with checks
>>>   that a struct page exists and is accessible as regular RAM) and not
>>>   checked forms (useful when the user knows the physaddr is valid,
>>>   for example because it got it from Page::into_phys()).
>>>
>>> Of course, all but the first two have to be `unsafe` by nature, but that
>>> comes with the territory of writing low level memory management code.
>>>
>>> These methods allow page table code to know the physical address of
>>> pages (needed to build intermediate level PTEs) and to essentially
>>> transfer ownership of the pages into the page table structure itself,
>>> and back into Page objects when freeing page tables. Without that, the
>>> code would have to keep track of page allocations in duplicate, once in
>>> Rust code and once in the page table structure itself, which is less
>>> desirable.
>>>
>>> For Apple GPUs, the address space shared between firmware and the driver
>>> is actually pre-allocated by the bootloader, with the top level page
>>> table already pre-allocated, and the firmware owning some PTEs within it
>>> while the kernel populates others. This cooperation works well when the
>>> kernel can reference this top level page table by physical address. The
>>> only thing the driver needs to ensure is that it never attempts to free
>>> it in this case, nor the page tables corresponding to virtual address
>>> ranges it doesn't own. Without the ability to just borrow the
>>> pre-allocated top level page and access it, the driver would have to
>>> special-case this and manually manage the top level PTEs outside the
>>> main page table code, as well as introduce different page table
>>> configurations with different numbers of levels so the kernel's view is
>>> one lever shallower.
>>>
>>> The physical address borrow feature is also useful to generate virtual
>>> address space dumps for crash dumps, including firmware pages. The
>>> intent is that firmware pages are configured in the Device Tree as
>>> reserved System RAM (without no-map), which creates struct page objects
>>> for them and makes them available in the kernel's direct map. Then the
>>> driver's page table code can walk the page tables and make a snapshot of
>>> the entire address space, including firmware code and data pages,
>>> pre-allocated shared segments, and driver-allocated objects (which are
>>> GEM objects), again without special casing anything. The checks in
>>> `Page::borrow_phys()` should ensure that the page is safe to access as
>>> RAM, so this will skip MMIO pages and anything that wasn't declared to
>>> the kernel in the DT.
>>>
>>> Example usage:
>>> https://github.com/AsahiLinux/linux/blob/gpu/rust-wip/drivers/gpu/drm/asahi/pgtable.rs
>>>
>>> The last patch is a minor cleanup to the Page abstraction noticed while
>>> preparing this series.
>>>
>>> [1] https://lore.kernel.org/lkml/20241119112408.779243-1-abdiel.janulgue@gmail.com/T/#u
>>>
>>> Signed-off-by: Asahi Lina <lina@asahilina.net>
>>> ---
>>> Asahi Lina (6):
>>>       rust: types: Add Ownable/Owned types
>>>       rust: page: Convert to Ownable
>>>       rust: page: Make with_page_mapped() and with_pointer_into_page() public
>>>       rust: addr: Add a module to declare core address types
>>>       rust: page: Add physical address conversion functions
>>>       rust: page: Make Page::as_ptr() pub(crate)
>>>
>>>  rust/helpers/page.c  |  26 ++++++++++++
>>>  rust/kernel/addr.rs  |  15 +++++++
>>>  rust/kernel/lib.rs   |   1 +
>>>  rust/kernel/page.rs  | 101 ++++++++++++++++++++++++++++++++++++++--------
>>>  rust/kernel/types.rs | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  5 files changed, 236 insertions(+), 17 deletions(-)
>>> ---
>>> base-commit: ffd294d346d185b70e28b1a28abe367bbfe53c04
>>> change-id: 20250202-rust-page-80892069fc78
>>>
>>> Cheers,
>>> ~~ Lina
>>>
>>
>
> ~~ Lina

--
Best Regards,
Yan, Zi
David Hildenbrand Feb. 4, 2025, 10:26 a.m. UTC | #6
On 03.02.25 22:05, Zi Yan wrote:
> On 3 Feb 2025, at 9:32, Asahi Lina wrote:
> 
>> On 2/3/25 6:58 PM, Simona Vetter wrote:
>>> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>>>> This series refactors the existing Page wrapper to support borrowing
>>>> `struct page` objects without ownership on the Rust side, and converting
>>>> page references to/from physical memory addresses.
>>>>
>>>> The series overlaps with the earlier submission in [1] and follows a
>>>> different approach, based on the discussion that happened there.
>>>>
>>>> The primary use case for this is implementing IOMMU-style page table
>>>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>>>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>>>> logic is similar to how ARM SMMU page tables are managed in the
>>>> drivers/iommu tree.
>>>>
>>>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>>>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>>>> are not ref counted but rather have a single intended owner.
>>>>
>>>> Then, refactor the existing Page support to use the new mechanism. Pages
>>>> returned from the page allocator are not intended to be ref counted by
>>>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>>>> page ownership as a simple "owned or not". Of course, this is still
>>>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>>>> own Page allocations for whatever reason.
>>>
>>> I think there's a bit a potential mess here because the conversion to
>>> folios isn't far enough yet that we can entirely ignore page refcounts and
>>> just use folio refcounts. But I guess we can deal with that oddity if we
>>> hit it (maybe folio conversion moves fast enough), since this only really
>>> starts to become relevant for hmm/svm gpu stuff.
>>>
>>> iow I think anticipating the future where struct page really doesn't have
>>> a refcount is the right move. Aside from that it's really not a refcount
>>> that works in the rust ARef sense, since struct page cannot disappear for
>>> system memory, and for dev_pagemap memory it's an entirely different
>>> reference you need (and then there's a few more special cases).
>>
>> Right, as far as this abstraction is concerned, all that needs to hold
>> is that:
>>
>> - alloc_pages() and __free_pages() work as intended, however that may
>> be, to reserve and return one page (for now, though I think extending
>> the Rust abstraction to handle higher-order folios is pretty easy, but
>> that can happen later).
>> - Whatever borrows pages knows what it's doing. In this case there's
>> only support for borrowing pages by physaddr, and it's only going to be
>> used in a driver for a platform without memory hot remove (so far) and
>> only for pages which have known usage (in principle) and are either
>> explicitly allocated or known pinned or reserved, so it's not a problem
>> right now. Future abstractions that return borrowed pages can do their
>> own locking/bookkeeping/whatever is necessary to keep it safe.
>>
>> I would like to hear how memory hot-remove is supposed to work though,
>> to see if we should be doing something to make the abstraction safer
>> (though it's still unsafe and always will be). Is there a chance a
>> `struct page` could vanish out from under us under some conditions?
> 
> Add DavidH and OscarS for memory hot-remove questions.
> 
> IIUC, struct page could be freed if a chunk of memory is hot-removed.

Right, but only after there are no users anymore (IOW, memory was freed 
back to the buddy). PFN walkers might still stumble over them, but I 
would not expect (or recommend) rust to do that.

> 
> Another case struct page can be freed is when hugetlb vmemmap optimization
> is used. Muchun (cc'd) is the maintainer of hugetlbfs.

Here, the "struct page" remains valid though; it can still be accessed, 
although we disallow writes (which would be wrong).

If you only allocate a page and free it later, there is no need to worry 
about either on the rust side.
David Hildenbrand Feb. 4, 2025, 10:33 a.m. UTC | #7
On 03.02.25 10:58, Simona Vetter wrote:
> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>> This series refactors the existing Page wrapper to support borrowing
>> `struct page` objects without ownership on the Rust side, and converting
>> page references to/from physical memory addresses.
>>
>> The series overlaps with the earlier submission in [1] and follows a
>> different approach, based on the discussion that happened there.
>>
>> The primary use case for this is implementing IOMMU-style page table
>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>> logic is similar to how ARM SMMU page tables are managed in the
>> drivers/iommu tree.
>>
>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>> are not ref counted but rather have a single intended owner.
>>
>> Then, refactor the existing Page support to use the new mechanism. Pages
>> returned from the page allocator are not intended to be ref counted by
>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>> page ownership as a simple "owned or not". Of course, this is still
>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>> own Page allocations for whatever reason.
> 
> I think there's a bit a potential mess here because the conversion to
> folios isn't far enough yet that we can entirely ignore page refcounts and
> just use folio refcounts. But I guess we can deal with that oddity if we
> hit it (maybe folio conversion moves fast enough), since this only really
> starts to become relevant for hmm/svm gpu stuff.

I'll note that in the future only selected things will be folios (e.g., 
pagecache, anonymous memory). Everything else will either get a separate 
memdesc (e.g., ptdesc), or work on bare pages.

Likely, when talking about page tables, "ptdesc" might be what you want 
to allocate here, and not "folios".

So in the future, this interface here will likely look quite a bit 
different.

(I know there was a discussion on that on RFC v3 with Willy, just 
stumbled over the usage of "folio" here)
Asahi Lina Feb. 4, 2025, 11:41 a.m. UTC | #8
On 2/4/25 7:26 PM, David Hildenbrand wrote:
> On 03.02.25 22:05, Zi Yan wrote:
>> On 3 Feb 2025, at 9:32, Asahi Lina wrote:
>>
>>> On 2/3/25 6:58 PM, Simona Vetter wrote:
>>>> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>>>>> This series refactors the existing Page wrapper to support borrowing
>>>>> `struct page` objects without ownership on the Rust side, and
>>>>> converting
>>>>> page references to/from physical memory addresses.
>>>>>
>>>>> The series overlaps with the earlier submission in [1] and follows a
>>>>> different approach, based on the discussion that happened there.
>>>>>
>>>>> The primary use case for this is implementing IOMMU-style page table
>>>>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>>>>> SoC devices to be written in Rust (such as embedded GPUs). The
>>>>> intended
>>>>> logic is similar to how ARM SMMU page tables are managed in the
>>>>> drivers/iommu tree.
>>>>>
>>>>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>>>>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>>>>> are not ref counted but rather have a single intended owner.
>>>>>
>>>>> Then, refactor the existing Page support to use the new mechanism.
>>>>> Pages
>>>>> returned from the page allocator are not intended to be ref counted by
>>>>> consumers (see previous discussion in [1]), so this keeps Rust's
>>>>> view of
>>>>> page ownership as a simple "owned or not". Of course, this is still
>>>>> composable as Arc<Owned<Page>> if Rust code needs to reference
>>>>> count its
>>>>> own Page allocations for whatever reason.
>>>>
>>>> I think there's a bit a potential mess here because the conversion to
>>>> folios isn't far enough yet that we can entirely ignore page
>>>> refcounts and
>>>> just use folio refcounts. But I guess we can deal with that oddity
>>>> if we
>>>> hit it (maybe folio conversion moves fast enough), since this only
>>>> really
>>>> starts to become relevant for hmm/svm gpu stuff.
>>>>
>>>> iow I think anticipating the future where struct page really doesn't
>>>> have
>>>> a refcount is the right move. Aside from that it's really not a
>>>> refcount
>>>> that works in the rust ARef sense, since struct page cannot
>>>> disappear for
>>>> system memory, and for dev_pagemap memory it's an entirely different
>>>> reference you need (and then there's a few more special cases).
>>>
>>> Right, as far as this abstraction is concerned, all that needs to hold
>>> is that:
>>>
>>> - alloc_pages() and __free_pages() work as intended, however that may
>>> be, to reserve and return one page (for now, though I think extending
>>> the Rust abstraction to handle higher-order folios is pretty easy, but
>>> that can happen later).
>>> - Whatever borrows pages knows what it's doing. In this case there's
>>> only support for borrowing pages by physaddr, and it's only going to be
>>> used in a driver for a platform without memory hot remove (so far) and
>>> only for pages which have known usage (in principle) and are either
>>> explicitly allocated or known pinned or reserved, so it's not a problem
>>> right now. Future abstractions that return borrowed pages can do their
>>> own locking/bookkeeping/whatever is necessary to keep it safe.
>>>
>>> I would like to hear how memory hot-remove is supposed to work though,
>>> to see if we should be doing something to make the abstraction safer
>>> (though it's still unsafe and always will be). Is there a chance a
>>> `struct page` could vanish out from under us under some conditions?
>>
>> Add DavidH and OscarS for memory hot-remove questions.
>>
>> IIUC, struct page could be freed if a chunk of memory is hot-removed.
> 
> Right, but only after there are no users anymore (IOW, memory was freed
> back to the buddy). PFN walkers might still stumble over them, but I
> would not expect (or recommend) rust to do that.

The physaddr to page function does look up pages by pfn, but it's
intended to be used by drivers that know what they're doing. There are
two variants of the API, one that is completely unchecked (a fast path
for cases where the driver definitely allocated these pages itself, for
example just grabbing the `struct page` back from a decoded PTE it
wrote), and one that has this check:

pfn_valid(pfn) && page_is_ram(pfn)

Which is intended as a safety net to allow drivers to look up
firmware-reserved pages too, and fail gracefully if the kernel doesn't
know about them (because they weren't declared in the
bootloader/firmware memory map correctly) or doesn't have them mapped in
the direct map (because they were declared no-map).

Is there anything else that can reasonably be done here to make the API
safer to call on an arbitrary pfn?

If the answer is "no" then that's fine. It's still an unsafe function
and we need to document in the safety section that it should only be
used for memory that is either known to be allocated and pinned and will
not be freed while the `struct page` is borrowed, or memory that is
reserved and not owned by the buddy allocator, so in practice correct
use would not be racy with memory hot-remove anyway.

This is already the case for the drm/asahi use case, where the pfns
looked up will only ever be one of:

- GEM objects that are mapped to the GPU and whose physical pages are
therefore pinned (and the VM is locked while this happens so the objects
cannot become unpinned out from under the running code),
- Raw pages allocated from the page allocator for use as GPU page tables,
- System memory that is marked reserved by the firmware/bootloader,
- (Potentially) invalid PFNs that aren't part of the System RAM region
at all and don't have a struct page to begin with, which we check for,
so the API returns an error. This would only happen if the bootloader
didn't declare some used firmware ranges at all, so Linux doesn't know
about them.

> 
>>
>> Another case struct page can be freed is when hugetlb vmemmap
>> optimization
>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
> 
> Here, the "struct page" remains valid though; it can still be accessed,
> although we disallow writes (which would be wrong).
> 
> If you only allocate a page and free it later, there is no need to worry
> about either on the rust side.

This is what the safe API does. (Also the unsafe physaddr APIs if all
you ever do is convert an allocated page to a physaddr and back, which
is the only thing the GPU page table code does during normal use. The
walking leaf PFNs story is only for GPU device coredumps when the
firmware crashes.)

~~ Lina
David Hildenbrand Feb. 4, 2025, 11:59 a.m. UTC | #9
>>> Add DavidH and OscarS for memory hot-remove questions.
>>>
>>> IIUC, struct page could be freed if a chunk of memory is hot-removed.
>>
>> Right, but only after there are no users anymore (IOW, memory was freed
>> back to the buddy). PFN walkers might still stumble over them, but I
>> would not expect (or recommend) rust to do that.
> 
> The physaddr to page function does look up pages by pfn, but it's
> intended to be used by drivers that know what they're doing. There are
> two variants of the API, one that is completely unchecked (a fast path
> for cases where the driver definitely allocated these pages itself, for
> example just grabbing the `struct page` back from a decoded PTE it
> wrote), and one that has this check:
> 
> pfn_valid(pfn) && page_is_ram(pfn)
> 
> Which is intended as a safety net to allow drivers to look up
> firmware-reserved pages too, and fail gracefully if the kernel doesn't
> know about them (because they weren't declared in the
> bootloader/firmware memory map correctly) or doesn't have them mapped in
> the direct map (because they were declared no-map).
 > > Is there anything else that can reasonably be done here to make the API
> safer to call on an arbitrary pfn?

In PFN walkers we use pfn_to_online_page() to make sure that (1) the 
memmap really exists; and (2) that it has meaning (e.g., was actually 
initialized).

It can still race with memory offlining, and it refuses ZONE_DEVICE 
pages. For the latter, we have a different way to check validity. See 
memory_failure() that first calls pfn_to_online_page() to then check 
get_dev_pagemap().

> 
> If the answer is "no" then that's fine. It's still an unsafe function
> and we need to document in the safety section that it should only be
> used for memory that is either known to be allocated and pinned and will
> not be freed while the `struct page` is borrowed, or memory that is
> reserved and not owned by the buddy allocator, so in practice correct
> use would not be racy with memory hot-remove anyway.
> 
> This is already the case for the drm/asahi use case, where the pfns
> looked up will only ever be one of:
> 
> - GEM objects that are mapped to the GPU and whose physical pages are
> therefore pinned (and the VM is locked while this happens so the objects
> cannot become unpinned out from under the running code),

How exactly are these pages pinned/obtained?

> - Raw pages allocated from the page allocator for use as GPU page tables,

That makes sense.

> - System memory that is marked reserved by the firmware/bootloader,

E.g., in arch/x86/mm/ioremap.c:__ioremap_check_ram() we refuse anything 
that has a valid memmap and is *not* marked as PageReserved, to prevent 
remapping arbitrary *real* RAM.

Is that case here similar?

> - (Potentially) invalid PFNs that aren't part of the System RAM region
> at all and don't have a struct page to begin with, which we check for,
> so the API returns an error. This would only happen if the bootloader
> didn't declare some used firmware ranges at all, so Linux doesn't know
> about them.
> 
>>
>>>
>>> Another case struct page can be freed is when hugetlb vmemmap
>>> optimization
>>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
>>
>> Here, the "struct page" remains valid though; it can still be accessed,
>> although we disallow writes (which would be wrong).
>>
>> If you only allocate a page and free it later, there is no need to worry
>> about either on the rust side.
> 
> This is what the safe API does. (Also the unsafe physaddr APIs if all
> you ever do is convert an allocated page to a physaddr and back, which
> is the only thing the GPU page table code does during normal use. The
> walking leaf PFNs story is only for GPU device coredumps when the
> firmware crashes.)

I would hope that we can lock down this interface as much as possible.

Ideally, we would never go from pfn->page, unless

(a) we remember somehow that we came from page->pfn. E.g., we allocated
     these pages or someone else provided us with these pages. The memmap
     cannot go away. I know it's hard.
(b) the pages are flagged as being special, similar to
     __ioremap_check_ram().
Asahi Lina Feb. 4, 2025, 1:05 p.m. UTC | #10
On 2/4/25 8:59 PM, David Hildenbrand wrote:
>>>> Add DavidH and OscarS for memory hot-remove questions.
>>>>
>>>> IIUC, struct page could be freed if a chunk of memory is hot-removed.
>>>
>>> Right, but only after there are no users anymore (IOW, memory was freed
>>> back to the buddy). PFN walkers might still stumble over them, but I
>>> would not expect (or recommend) rust to do that.
>>
>> The physaddr to page function does look up pages by pfn, but it's
>> intended to be used by drivers that know what they're doing. There are
>> two variants of the API, one that is completely unchecked (a fast path
>> for cases where the driver definitely allocated these pages itself, for
>> example just grabbing the `struct page` back from a decoded PTE it
>> wrote), and one that has this check:
>>
>> pfn_valid(pfn) && page_is_ram(pfn)
>>
>> Which is intended as a safety net to allow drivers to look up
>> firmware-reserved pages too, and fail gracefully if the kernel doesn't
>> know about them (because they weren't declared in the
>> bootloader/firmware memory map correctly) or doesn't have them mapped in
>> the direct map (because they were declared no-map).
>> > Is there anything else that can reasonably be done here to make the API
>> safer to call on an arbitrary pfn?
> 
> In PFN walkers we use pfn_to_online_page() to make sure that (1) the
> memmap really exists; and (2) that it has meaning (e.g., was actually
> initialized).
> 
> It can still race with memory offlining, and it refuses ZONE_DEVICE
> pages. For the latter, we have a different way to check validity. See
> memory_failure() that first calls pfn_to_online_page() to then check
> get_dev_pagemap().

I'll give it a shot with these functions. If they work for my use case,
then it's good to have extra checks and I'll add them for v2. Thanks!

> 
>>
>> If the answer is "no" then that's fine. It's still an unsafe function
>> and we need to document in the safety section that it should only be
>> used for memory that is either known to be allocated and pinned and will
>> not be freed while the `struct page` is borrowed, or memory that is
>> reserved and not owned by the buddy allocator, so in practice correct
>> use would not be racy with memory hot-remove anyway.
>>
>> This is already the case for the drm/asahi use case, where the pfns
>> looked up will only ever be one of:
>>
>> - GEM objects that are mapped to the GPU and whose physical pages are
>> therefore pinned (and the VM is locked while this happens so the objects
>> cannot become unpinned out from under the running code),
> 
> How exactly are these pages pinned/obtained?

Under the hood it's shmem. For pinning, it winds up at
`drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
a mapping set as unevictable.

I'm not very familiar with the innards of that codepath, but it's
definitely an invariant that GEM objects have to be pinned while they
are mapped in GPU page tables (otherwise the GPU would end up accessing
freed memory).

Since the code that walks the PT to dump pages is part of the same PT
object and takes a mutable reference, the Rust guarantees mean it's
impossible for the PT to be concurrently mutated or anything like that.
So if one of these objects *were* unpinned/freed somehow while the dump
code is running, that would be a major bug somewhere else, since there
would be dangling PTEs left over.

In practice, there's a big lock around each PT/VM at a higher level of
the driver, so any attempts to unmap/free any of those objects will be
stuck waiting for the lock on the VM they are mapped into.

> 
>> - Raw pages allocated from the page allocator for use as GPU page tables,
> 
> That makes sense.
> 
>> - System memory that is marked reserved by the firmware/bootloader,
> 
> E.g., in arch/x86/mm/ioremap.c:__ioremap_check_ram() we refuse anything
> that has a valid memmap and is *not* marked as PageReserved, to prevent
> remapping arbitrary *real* RAM.
> 
> Is that case here similar?

I don't have an explicit check for that here but yes, the pages wind up
marked PageReserved. This includes both no-map ranges and map ranges.
The no-map ranges also have a valid struct page if within the declared
RAM range but they would oops if we try to access the contents via
direct map, so the page_is_ram() check is there to reject those (and
MMIO and anything else that isn't normally mapped as RAM even if it
winds up with a struct page).

>> - (Potentially) invalid PFNs that aren't part of the System RAM region
>> at all and don't have a struct page to begin with, which we check for,
>> so the API returns an error. This would only happen if the bootloader
>> didn't declare some used firmware ranges at all, so Linux doesn't know
>> about them.
>>
>>>
>>>>
>>>> Another case struct page can be freed is when hugetlb vmemmap
>>>> optimization
>>>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
>>>
>>> Here, the "struct page" remains valid though; it can still be accessed,
>>> although we disallow writes (which would be wrong).
>>>
>>> If you only allocate a page and free it later, there is no need to worry
>>> about either on the rust side.
>>
>> This is what the safe API does. (Also the unsafe physaddr APIs if all
>> you ever do is convert an allocated page to a physaddr and back, which
>> is the only thing the GPU page table code does during normal use. The
>> walking leaf PFNs story is only for GPU device coredumps when the
>> firmware crashes.)
> 
> I would hope that we can lock down this interface as much as possible.

Right, that's why the safe API never does any of the weird pfn->page
stuff. Rust driver code has to use unsafe {} to access the raw pfn->page
interface, which requires a // SAFETY comment explaining why what it's
doing is safe, and then we need to document in the function signature
what the safety requirements are so those comments can be reviewed.

> Ideally, we would never go from pfn->page, unless
> 
> (a) we remember somehow that we came from page->pfn. E.g., we allocated
>     these pages or someone else provided us with these pages. The memmap
>     cannot go away. I know it's hard.

This is the common case for the page tables. 99% of the time this is
what the driver will be doing, with a single exception (the root page
table of the firmware/privileged VM is a system reserved memory region,
and falls under (b). It's one single page globally in the system.).

The driver actually uses the completely unchecked interface in this
case, since it knows the pfns are definitely OK. I do a single check
with the checked interface at probe time for that one special-case pfn
so it can fail gracefully instead of oops if the DT config is
unusable/wrong.

> (b) the pages are flagged as being special, similar to
>     __ioremap_check_ram().

This only ever happens during firmware crash dumps (plus the one
exception above).

The missing (c) case is the kernel/firmware shared memory GEM objects
during crash dumps. But I really need those to diagnose firmware
crashes. Of course, I could dump them separately through other APIs in
principle, but that would complicate the crashdump code quite a bit
since I'd have to go through all the kernel GPU memory allocators and
dig out all their backing GEM objects and copy the memory through their
vmap (they are all vmapped, which is yet another reason in practice the
pages are pinned) and merge it into the coredump file. I also wouldn't
have easy direct access to the matching GPU PTEs if I do that (I store
the PTE permission/caching bits in the coredump file, since those are
actually kind of critical to diagnose exactly what happened, as caching
issues are one major cause of firmware problems). Since I need the page
table walker code to grab the firmware pages anyway, I hope I can avoid
having to go through a completely different codepath for the kernel GEM
objects...

~~ Lina
David Hildenbrand Feb. 4, 2025, 2:38 p.m. UTC | #11
>> It can still race with memory offlining, and it refuses ZONE_DEVICE
>> pages. For the latter, we have a different way to check validity. See
>> memory_failure() that first calls pfn_to_online_page() to then check
>> get_dev_pagemap().
> 
> I'll give it a shot with these functions. If they work for my use case,
 > then it's good to have extra checks and I'll add them for v2. Thanks!

Let me know if you run into any issues.

> 
>>
>>>
>>> If the answer is "no" then that's fine. It's still an unsafe function
>>> and we need to document in the safety section that it should only be
>>> used for memory that is either known to be allocated and pinned and will
>>> not be freed while the `struct page` is borrowed, or memory that is
>>> reserved and not owned by the buddy allocator, so in practice correct
>>> use would not be racy with memory hot-remove anyway.
>>>
>>> This is already the case for the drm/asahi use case, where the pfns
>>> looked up will only ever be one of:
>>>
>>> - GEM objects that are mapped to the GPU and whose physical pages are
>>> therefore pinned (and the VM is locked while this happens so the objects
>>> cannot become unpinned out from under the running code),
>>
>> How exactly are these pages pinned/obtained?
> 
> Under the hood it's shmem. For pinning, it winds up at
> `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
> a mapping set as unevictable.

Thanks. So we grab another folio reference via 
shmem_read_folio_gfp()->shmem_get_folio_gfp().

Hm, I wonder if we might end up holding folios residing in 
ZONE_MOVABLE/MIGRATE_CMA longer than we should.

Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes 
sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.

But that's a different discussion, just pointing it out, maybe I'm 
missing something :)

> 
> I'm not very familiar with the innards of that codepath, but it's
> definitely an invariant that GEM objects have to be pinned while they
> are mapped in GPU page tables (otherwise the GPU would end up accessing
> freed memory).

Right, there must be a raised reference.

> 
> Since the code that walks the PT to dump pages is part of the same PT
> object and takes a mutable reference, the Rust guarantees mean it's
> impossible for the PT to be concurrently mutated or anything like that.
> So if one of these objects *were* unpinned/freed somehow while the dump
> code is running, that would be a major bug somewhere else, since there
> would be dangling PTEs left over.
> 
> In practice, there's a big lock around each PT/VM at a higher level of
> the driver, so any attempts to unmap/free any of those objects will be
> stuck waiting for the lock on the VM they are mapped into.

Understood, thanks.

[...]

>>>>>
>>>>> Another case struct page can be freed is when hugetlb vmemmap
>>>>> optimization
>>>>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
>>>>
>>>> Here, the "struct page" remains valid though; it can still be accessed,
>>>> although we disallow writes (which would be wrong).
>>>>
>>>> If you only allocate a page and free it later, there is no need to worry
>>>> about either on the rust side.
>>>
>>> This is what the safe API does. (Also the unsafe physaddr APIs if all
>>> you ever do is convert an allocated page to a physaddr and back, which
>>> is the only thing the GPU page table code does during normal use. The
>>> walking leaf PFNs story is only for GPU device coredumps when the
>>> firmware crashes.)
>>
>> I would hope that we can lock down this interface as much as possible.
> 
> Right, that's why the safe API never does any of the weird pfn->page
> stuff. Rust driver code has to use unsafe {} to access the raw pfn->page
> interface, which requires a // SAFETY comment explaining why what it's
> doing is safe, and then we need to document in the function signature
> what the safety requirements are so those comments can be reviewed.
> 
>> Ideally, we would never go from pfn->page, unless
>>
>> (a) we remember somehow that we came from page->pfn. E.g., we allocated
>>      these pages or someone else provided us with these pages. The memmap
>>      cannot go away. I know it's hard.
> 
> This is the common case for the page tables. 99% of the time this is
> what the driver will be doing, with a single exception (the root page
> table of the firmware/privileged VM is a system reserved memory region,
> and falls under (b). It's one single page globally in the system.).

Makes sense.

> 
> The driver actually uses the completely unchecked interface in this
> case, since it knows the pfns are definitely OK. I do a single check
> with the checked interface at probe time for that one special-case pfn
> so it can fail gracefully instead of oops if the DT config is
> unusable/wrong.
> 
>> (b) the pages are flagged as being special, similar to
>>      __ioremap_check_ram().
> 
> This only ever happens during firmware crash dumps (plus the one
> exception above).
> 
> The missing (c) case is the kernel/firmware shared memory GEM objects
> during crash dumps.

If it's only for crash dumps etc. that might even be opt-in, it makes 
the whole thing a lot less scary. Maybe this could be opt-in somewhere, 
to "unlock" this interface? Just an idea.

> But I really need those to diagnose firmware
> crashes. Of course, I could dump them separately through other APIs in
> principle, but that would complicate the crashdump code quite a bit
> since I'd have to go through all the kernel GPU memory allocators and
> dig out all their backing GEM objects and copy the memory through their
> vmap (they are all vmapped, which is yet another reason in practice the
> pages are pinned) and merge it into the coredump file. I also wouldn't
> have easy direct access to the matching GPU PTEs if I do that (I store
> the PTE permission/caching bits in the coredump file, since those are
> actually kind of critical to diagnose exactly what happened, as caching
> issues are one major cause of firmware problems). Since I need the page
> table walker code to grab the firmware pages anyway, I hope I can avoid
> having to go through a completely different codepath for the kernel GEM
> objects...

Makes sense.
Asahi Lina Feb. 4, 2025, 5:59 p.m. UTC | #12
On 2/4/25 11:38 PM, David Hildenbrand wrote:
>>>> If the answer is "no" then that's fine. It's still an unsafe function
>>>> and we need to document in the safety section that it should only be
>>>> used for memory that is either known to be allocated and pinned and
>>>> will
>>>> not be freed while the `struct page` is borrowed, or memory that is
>>>> reserved and not owned by the buddy allocator, so in practice correct
>>>> use would not be racy with memory hot-remove anyway.
>>>>
>>>> This is already the case for the drm/asahi use case, where the pfns
>>>> looked up will only ever be one of:
>>>>
>>>> - GEM objects that are mapped to the GPU and whose physical pages are
>>>> therefore pinned (and the VM is locked while this happens so the
>>>> objects
>>>> cannot become unpinned out from under the running code),
>>>
>>> How exactly are these pages pinned/obtained?
>>
>> Under the hood it's shmem. For pinning, it winds up at
>> `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
>> a mapping set as unevictable.
> 
> Thanks. So we grab another folio reference via shmem_read_folio_gfp()-
>>shmem_get_folio_gfp().
> 
> Hm, I wonder if we might end up holding folios residing in ZONE_MOVABLE/
> MIGRATE_CMA longer than we should.
> 
> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes
> sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
> 
> But that's a different discussion, just pointing it out, maybe I'm
> missing something :)

I think this is a little over my head. Though I only just realized that
we seem to be keeping the GEM objects pinned forever, even after unmap,
in the drm-shmem core API (I see no drm-shmem entry point that would
allow the sgt to be freed and its corresponding pages ref to be dropped,
other than a purge of purgeable objects or final destruction of the
object). I'll poke around since this feels wrong, I thought we were
supposed to be able to have shrinker support for swapping out whole GPU
VMs in the modern GPU MM model, but I guess there's no implementation of
that for gem-shmem drivers yet...?

That's a discussion for the DRM side though.

> 
>>
>> I'm not very familiar with the innards of that codepath, but it's
>> definitely an invariant that GEM objects have to be pinned while they
>> are mapped in GPU page tables (otherwise the GPU would end up accessing
>> freed memory).
> 
> Right, there must be a raised reference.
> 

[...]

>>>>>> Another case struct page can be freed is when hugetlb vmemmap
>>>>>> optimization
>>>>>> is used. Muchun (cc'd) is the maintainer of hugetlbfs.
>>>>>
>>>>> Here, the "struct page" remains valid though; it can still be
>>>>> accessed,
>>>>> although we disallow writes (which would be wrong).
>>>>>
>>>>> If you only allocate a page and free it later, there is no need to
>>>>> worry
>>>>> about either on the rust side.
>>>>
>>>> This is what the safe API does. (Also the unsafe physaddr APIs if all
>>>> you ever do is convert an allocated page to a physaddr and back, which
>>>> is the only thing the GPU page table code does during normal use. The
>>>> walking leaf PFNs story is only for GPU device coredumps when the
>>>> firmware crashes.)
>>>
>>> I would hope that we can lock down this interface as much as possible.
>>
>> Right, that's why the safe API never does any of the weird pfn->page
>> stuff. Rust driver code has to use unsafe {} to access the raw pfn->page
>> interface, which requires a // SAFETY comment explaining why what it's
>> doing is safe, and then we need to document in the function signature
>> what the safety requirements are so those comments can be reviewed.
>>
>>> Ideally, we would never go from pfn->page, unless
>>>
>>> (a) we remember somehow that we came from page->pfn. E.g., we allocated
>>>      these pages or someone else provided us with these pages. The
>>> memmap
>>>      cannot go away. I know it's hard.
>>
>> This is the common case for the page tables. 99% of the time this is
>> what the driver will be doing, with a single exception (the root page
>> table of the firmware/privileged VM is a system reserved memory region,
>> and falls under (b). It's one single page globally in the system.).
> 
> Makes sense.
> 
>>
>> The driver actually uses the completely unchecked interface in this
>> case, since it knows the pfns are definitely OK. I do a single check
>> with the checked interface at probe time for that one special-case pfn
>> so it can fail gracefully instead of oops if the DT config is
>> unusable/wrong.
>>
>>> (b) the pages are flagged as being special, similar to
>>>      __ioremap_check_ram().
>>
>> This only ever happens during firmware crash dumps (plus the one
>> exception above).
>>
>> The missing (c) case is the kernel/firmware shared memory GEM objects
>> during crash dumps.
> 
> If it's only for crash dumps etc. that might even be opt-in, it makes
> the whole thing a lot less scary. Maybe this could be opt-in somewhere,
> to "unlock" this interface? Just an idea.

Just to make sure we're on the same page, I don't think there's anything
to unlock in the Rust abstraction side (this series). At the end of the
day, if nothing else, the unchecked interface (which the regular
non-crash page table management code uses for performance) will let you
use any pfn you want, it's up to documentation and human review to
specify how it should be used by drivers. What Rust gives us here is the
mandatory `unsafe {}`, so any attempts to use this API will necessarily
stick out during review as potentially dangerous code that needs extra
scrutiny.

For the client driver itself, I could gate the devcoredump stuff behind
a module parameter or something... but I don't think it's really worth
it. We don't have a way to reboot the firmware or recover from this
condition (platform limitations), so end users are stuck rebooting to
get back a usable machine anyway. If something goes wrong in the
crashdump code and the machine oopses or locks up worse... it doesn't
really make much of a difference for normal end users. I don't think
this will ever really happen given the constraints I described, but if
somehow it does (some other bug somewhere?), well... the machine was
already in an unrecoverable state anyway.

It would be nice to have userspace tooling deployed by default that
saves off the devcoredump somewhere, so we can have a chance at
debugging hard-to-hit firmware crashes... if it's opt-in, it would only
really be useful for developers and CI machines.

There *is* a system-global devcoredump disable, but it's not exposed
outside of the devcoredump core. It just lets coredumps happen and then
throws away the result. It might be worth sending out a patch to expose
that to drivers, so they can skip the whole coredump generation
machinery if it's unnecessary.

>> But I really need those to diagnose firmware
>> crashes. Of course, I could dump them separately through other APIs in
>> principle, but that would complicate the crashdump code quite a bit
>> since I'd have to go through all the kernel GPU memory allocators and
>> dig out all their backing GEM objects and copy the memory through their
>> vmap (they are all vmapped, which is yet another reason in practice the
>> pages are pinned) and merge it into the coredump file. I also wouldn't
>> have easy direct access to the matching GPU PTEs if I do that (I store
>> the PTE permission/caching bits in the coredump file, since those are
>> actually kind of critical to diagnose exactly what happened, as caching
>> issues are one major cause of firmware problems). Since I need the page
>> table walker code to grab the firmware pages anyway, I hope I can avoid
>> having to go through a completely different codepath for the kernel GEM
>> objects...
> 
> Makes sense.
> 

~~ Lina
Jason Gunthorpe Feb. 4, 2025, 6:39 p.m. UTC | #13
On Tue, Feb 04, 2025 at 11:33:24AM +0100, David Hildenbrand wrote:
> On 03.02.25 10:58, Simona Vetter wrote:
> > On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
> > > This series refactors the existing Page wrapper to support borrowing
> > > `struct page` objects without ownership on the Rust side, and converting
> > > page references to/from physical memory addresses.
> > > 
> > > The series overlaps with the earlier submission in [1] and follows a
> > > different approach, based on the discussion that happened there.
> > > 
> > > The primary use case for this is implementing IOMMU-style page table
> > > management in Rust. This allows drivers for IOMMUs and MMU-containing
> > > SoC devices to be written in Rust (such as embedded GPUs). The intended
> > > logic is similar to how ARM SMMU page tables are managed in the
> > > drivers/iommu tree.
> > > 
> > > First, introduce a concept of Owned<T> and an Ownable trait. These are
> > > similar to ARef<T> and AlwaysRefCounted, but are used for types which
> > > are not ref counted but rather have a single intended owner.
> > > 
> > > Then, refactor the existing Page support to use the new mechanism. Pages
> > > returned from the page allocator are not intended to be ref counted by
> > > consumers (see previous discussion in [1]), so this keeps Rust's view of
> > > page ownership as a simple "owned or not". Of course, this is still
> > > composable as Arc<Owned<Page>> if Rust code needs to reference count its
> > > own Page allocations for whatever reason.
> > 
> > I think there's a bit a potential mess here because the conversion to
> > folios isn't far enough yet that we can entirely ignore page refcounts and
> > just use folio refcounts. But I guess we can deal with that oddity if we
> > hit it (maybe folio conversion moves fast enough), since this only really
> > starts to become relevant for hmm/svm gpu stuff.
> 
> I'll note that in the future only selected things will be folios (e.g.,
> pagecache, anonymous memory). Everything else will either get a separate
> memdesc (e.g., ptdesc), or work on bare pages.
> 
> Likely, when talking about page tables, "ptdesc" might be what you want to
> allocate here, and not "folios".

I just posted a series to clean up the iommu code that this is
cribbing the interface from: add an ioptdesc, remove all the struct
page from the API and so forth:

https://lore.kernel.org/all/0-v1-416f64558c7c+2a5-iommu_pages_jgg@nvidia.com/

I strongly suggest that rust not expose these old schemes that will
need another cleanup like the above. Page tables just need an
allocator using simple void *, kmalloc is good enough for simple
cases.

Drivers should not be exposing or touching struct page just to
implement a page table.

Jason
Asahi Lina Feb. 4, 2025, 7:01 p.m. UTC | #14
On 2/5/25 3:39 AM, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2025 at 11:33:24AM +0100, David Hildenbrand wrote:
>> On 03.02.25 10:58, Simona Vetter wrote:
>>> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>>>> This series refactors the existing Page wrapper to support borrowing
>>>> `struct page` objects without ownership on the Rust side, and converting
>>>> page references to/from physical memory addresses.
>>>>
>>>> The series overlaps with the earlier submission in [1] and follows a
>>>> different approach, based on the discussion that happened there.
>>>>
>>>> The primary use case for this is implementing IOMMU-style page table
>>>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>>>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>>>> logic is similar to how ARM SMMU page tables are managed in the
>>>> drivers/iommu tree.
>>>>
>>>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>>>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>>>> are not ref counted but rather have a single intended owner.
>>>>
>>>> Then, refactor the existing Page support to use the new mechanism. Pages
>>>> returned from the page allocator are not intended to be ref counted by
>>>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>>>> page ownership as a simple "owned or not". Of course, this is still
>>>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>>>> own Page allocations for whatever reason.
>>>
>>> I think there's a bit a potential mess here because the conversion to
>>> folios isn't far enough yet that we can entirely ignore page refcounts and
>>> just use folio refcounts. But I guess we can deal with that oddity if we
>>> hit it (maybe folio conversion moves fast enough), since this only really
>>> starts to become relevant for hmm/svm gpu stuff.
>>
>> I'll note that in the future only selected things will be folios (e.g.,
>> pagecache, anonymous memory). Everything else will either get a separate
>> memdesc (e.g., ptdesc), or work on bare pages.
>>
>> Likely, when talking about page tables, "ptdesc" might be what you want to
>> allocate here, and not "folios".
> 
> I just posted a series to clean up the iommu code that this is
> cribbing the interface from: add an ioptdesc, remove all the struct
> page from the API and so forth:
> 
> https://lore.kernel.org/all/0-v1-416f64558c7c+2a5-iommu_pages_jgg@nvidia.com/
> 
> I strongly suggest that rust not expose these old schemes that will
> need another cleanup like the above. Page tables just need an
> allocator using simple void *, kmalloc is good enough for simple
> cases.
> 
> Drivers should not be exposing or touching struct page just to
> implement a page table.

iommu-pages.h looks like a private API internal to drivers/iommu. Is
there a plan to move that out so it can be used by non-iommu drivers?

If you think I should just use kmalloc, then I guess I could literally
just use KBox<PageTable> where PageTable is just a wrapper around [u64;
PTES_PER_PAGE] with an alignment attribute, or something like that. I
guess then just virt_to_phys() and back for the PTEs? So we still need
to add at least trivial wrappers to export those for Rust drivers to use.

I still need to be able to grab pointers to leaf pages for the crashdump
page table walker code though, and I want to keep the pfn and memory
type checks to make sure it won't crash trying to dereference the page
contents. So we also still need those too.

I'm not entirely sure what a higher-level Rust abstraction for this
looks like at this point, or whether it makes sense at all instead of
just trivially wrapping the C helpers... I don't particularly mind just
writing this part of the driver code "like C" (it's pretty much the
lowest-level bit of code in the driver anyway) but I guess we have to do
some thinking here.

Maybe this is one of those cases where we just do it ad-hoc for now, and
wait until a second driver that needs to implement page tables comes
around and we figure out what can be shared then, and what the API
should look like.

Hmm, starting to think maybe these should literally be impls on the
address types (PhysicalAddr::to_pfn(), PhysicalAddr::to_virt(), etc.).
Though that won't work without newtyping everything, but maybe we should?

~~ Lina
David Hildenbrand Feb. 4, 2025, 8:05 p.m. UTC | #15
On 04.02.25 19:39, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2025 at 11:33:24AM +0100, David Hildenbrand wrote:
>> On 03.02.25 10:58, Simona Vetter wrote:
>>> On Sun, Feb 02, 2025 at 10:05:42PM +0900, Asahi Lina wrote:
>>>> This series refactors the existing Page wrapper to support borrowing
>>>> `struct page` objects without ownership on the Rust side, and converting
>>>> page references to/from physical memory addresses.
>>>>
>>>> The series overlaps with the earlier submission in [1] and follows a
>>>> different approach, based on the discussion that happened there.
>>>>
>>>> The primary use case for this is implementing IOMMU-style page table
>>>> management in Rust. This allows drivers for IOMMUs and MMU-containing
>>>> SoC devices to be written in Rust (such as embedded GPUs). The intended
>>>> logic is similar to how ARM SMMU page tables are managed in the
>>>> drivers/iommu tree.
>>>>
>>>> First, introduce a concept of Owned<T> and an Ownable trait. These are
>>>> similar to ARef<T> and AlwaysRefCounted, but are used for types which
>>>> are not ref counted but rather have a single intended owner.
>>>>
>>>> Then, refactor the existing Page support to use the new mechanism. Pages
>>>> returned from the page allocator are not intended to be ref counted by
>>>> consumers (see previous discussion in [1]), so this keeps Rust's view of
>>>> page ownership as a simple "owned or not". Of course, this is still
>>>> composable as Arc<Owned<Page>> if Rust code needs to reference count its
>>>> own Page allocations for whatever reason.
>>>
>>> I think there's a bit a potential mess here because the conversion to
>>> folios isn't far enough yet that we can entirely ignore page refcounts and
>>> just use folio refcounts. But I guess we can deal with that oddity if we
>>> hit it (maybe folio conversion moves fast enough), since this only really
>>> starts to become relevant for hmm/svm gpu stuff.
>>
>> I'll note that in the future only selected things will be folios (e.g.,
>> pagecache, anonymous memory). Everything else will either get a separate
>> memdesc (e.g., ptdesc), or work on bare pages.
>>
>> Likely, when talking about page tables, "ptdesc" might be what you want to
>> allocate here, and not "folios".
> 
> I just posted a series to clean up the iommu code that this is
> cribbing the interface from: add an ioptdesc, remove all the struct
> page from the API and so forth:
> 
> https://lore.kernel.org/all/0-v1-416f64558c7c+2a5-iommu_pages_jgg@nvidia.com/
> 
> I strongly suggest that rust not expose these old schemes that will
> need another cleanup like the above. Page tables just need an
> allocator using simple void *, kmalloc is good enough for simple
> cases.
> 
> Drivers should not be exposing or touching struct page just to
> implement a page table.

Fully agreed, this is going into the right direction. Dumping what's 
mapped is a different story. Maybe that dumping logic could simply be 
written in C for the time being?
David Hildenbrand Feb. 4, 2025, 8:10 p.m. UTC | #16
On 04.02.25 18:59, Asahi Lina wrote:
> On 2/4/25 11:38 PM, David Hildenbrand wrote:
>>>>> If the answer is "no" then that's fine. It's still an unsafe function
>>>>> and we need to document in the safety section that it should only be
>>>>> used for memory that is either known to be allocated and pinned and
>>>>> will
>>>>> not be freed while the `struct page` is borrowed, or memory that is
>>>>> reserved and not owned by the buddy allocator, so in practice correct
>>>>> use would not be racy with memory hot-remove anyway.
>>>>>
>>>>> This is already the case for the drm/asahi use case, where the pfns
>>>>> looked up will only ever be one of:
>>>>>
>>>>> - GEM objects that are mapped to the GPU and whose physical pages are
>>>>> therefore pinned (and the VM is locked while this happens so the
>>>>> objects
>>>>> cannot become unpinned out from under the running code),
>>>>
>>>> How exactly are these pages pinned/obtained?
>>>
>>> Under the hood it's shmem. For pinning, it winds up at
>>> `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
>>> a mapping set as unevictable.
>>
>> Thanks. So we grab another folio reference via shmem_read_folio_gfp()-
>>> shmem_get_folio_gfp().
>>
>> Hm, I wonder if we might end up holding folios residing in ZONE_MOVABLE/
>> MIGRATE_CMA longer than we should.
>>
>> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes
>> sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
>>
>> But that's a different discussion, just pointing it out, maybe I'm
>> missing something :)
> 
> I think this is a little over my head. Though I only just realized that
> we seem to be keeping the GEM objects pinned forever, even after unmap,
> in the drm-shmem core API (I see no drm-shmem entry point that would
> allow the sgt to be freed and its corresponding pages ref to be dropped,
> other than a purge of purgeable objects or final destruction of the
> object). I'll poke around since this feels wrong, I thought we were
> supposed to be able to have shrinker support for swapping out whole GPU
> VMs in the modern GPU MM model, but I guess there's no implementation of
> that for gem-shmem drivers yet...?

I recall that shrinker as well, ... or at least a discussion around it.

[...]

>>
>> If it's only for crash dumps etc. that might even be opt-in, it makes
>> the whole thing a lot less scary. Maybe this could be opt-in somewhere,
>> to "unlock" this interface? Just an idea.
> 
> Just to make sure we're on the same page, I don't think there's anything
> to unlock in the Rust abstraction side (this series). At the end of the
> day, if nothing else, the unchecked interface (which the regular
> non-crash page table management code uses for performance) will let you
> use any pfn you want, it's up to documentation and human review to
> specify how it should be used by drivers. What Rust gives us here is the
> mandatory `unsafe {}`, so any attempts to use this API will necessarily
> stick out during review as potentially dangerous code that needs extra
> scrutiny.
> 
> For the client driver itself, I could gate the devcoredump stuff behind
> a module parameter or something... but I don't think it's really worth
> it. We don't have a way to reboot the firmware or recover from this
> condition (platform limitations), so end users are stuck rebooting to
> get back a usable machine anyway. If something goes wrong in the
> crashdump code and the machine oopses or locks up worse... it doesn't
> really make much of a difference for normal end users. I don't think
> this will ever really happen given the constraints I described, but if
> somehow it does (some other bug somewhere?), well... the machine was
> already in an unrecoverable state anyway.
> 
> It would be nice to have userspace tooling deployed by default that
> saves off the devcoredump somewhere, so we can have a chance at
> debugging hard-to-hit firmware crashes... if it's opt-in, it would only
> really be useful for developers and CI machines.

Is this something that possibly kdump can save or analyze? Because that 
is our default "oops, kernel crashed, let's dump the old content so we 
can dump it" mechanism on production systems.

but ... I am not familiar with devcoredump. So I don't know when/how it 
runs, and if the source system is still alive (and remains alive --  in 
contrast to a kernel crash).
Jason Gunthorpe Feb. 4, 2025, 8:26 p.m. UTC | #17
On Tue, Feb 04, 2025 at 09:05:47PM +0100, David Hildenbrand wrote:
> Fully agreed, this is going into the right direction. Dumping what's mapped
> is a different story. Maybe that dumping logic could simply be written in C
> for the time being?

?

Isn't dumping just a
  decode pte -> phys_to_virt() -> for_each_u64(virt) -> printk?

If you can populate the page table in rust you have to be able to dump
it, surely.

Or do you mean trying to decode the leaf entires into some struct page
detail? Does a GPU need to do that? Agree that would be better as some
C function to take in a KVA instead of a struct page and populate a
string with details about that KVA from the struct page.

Jason
David Hildenbrand Feb. 4, 2025, 8:41 p.m. UTC | #18
On 04.02.25 21:26, Jason Gunthorpe wrote:
> On Tue, Feb 04, 2025 at 09:05:47PM +0100, David Hildenbrand wrote:
>> Fully agreed, this is going into the right direction. Dumping what's mapped
>> is a different story. Maybe that dumping logic could simply be written in C
>> for the time being?
> 
> ?
> 
> Isn't dumping just a
>    decode pte -> phys_to_virt() -> for_each_u64(virt) -> printk?
> 

IIUC, the problematic bit is that you might not have a directmap such 
that phys_to_virt() would tell you the whole story.

> If you can populate the page table in rust you have to be able to dump
> it, surely.
> 
> Or do you mean trying to decode the leaf entires into some struct page
> detail? Does a GPU need to do that? Agree that would be better as some
> C function to take in a KVA instead of a struct page and populate a
> string with details about that KVA from the struct page.
> 
> Jason
>
David Hildenbrand Feb. 4, 2025, 8:47 p.m. UTC | #19
On 04.02.25 21:41, David Hildenbrand wrote:
> On 04.02.25 21:26, Jason Gunthorpe wrote:
>> On Tue, Feb 04, 2025 at 09:05:47PM +0100, David Hildenbrand wrote:
>>> Fully agreed, this is going into the right direction. Dumping what's mapped
>>> is a different story. Maybe that dumping logic could simply be written in C
>>> for the time being?
>>
>> ?
>>
>> Isn't dumping just a
>>     decode pte -> phys_to_virt() -> for_each_u64(virt) -> printk?
>>
> 
> IIUC, the problematic bit is that you might not have a directmap such
> that phys_to_virt() would tell you the whole story.

... but it's late and I am confused. For dumping the *page table* that 
would not be required, only when dumping mapped page content (and at 
this point I am not sure if that is a requirement).

So hopefully Asahi Lina can clarify what the issue was (if there is any 
:) ).
Jason Gunthorpe Feb. 4, 2025, 8:49 p.m. UTC | #20
On Tue, Feb 04, 2025 at 09:41:29PM +0100, David Hildenbrand wrote:
> On 04.02.25 21:26, Jason Gunthorpe wrote:
> > On Tue, Feb 04, 2025 at 09:05:47PM +0100, David Hildenbrand wrote:
> > > Fully agreed, this is going into the right direction. Dumping what's mapped
> > > is a different story. Maybe that dumping logic could simply be written in C
> > > for the time being?
> > 
> > ?
> > 
> > Isn't dumping just a
> >    decode pte -> phys_to_virt() -> for_each_u64(virt) -> printk?
> > 
> 
> IIUC, the problematic bit is that you might not have a directmap such that
> phys_to_virt() would tell you the whole story.

The phys_to_virt() I mean is on the page table pages themselves, not
on a leaf.

All page table pages came from alloc_pages_node() and you'd do
virt_to_phys() to stick them into a next-table PTE, then
phys_to_virt() to bring them back:

  phys_to_virt(virt_to_phys(page_address(alloc_pages_node())))

Effectively.

The leaf pages could be anything and you should enver phys_to_virt
those, this is what I mean by:

> > Or do you mean trying to decode the leaf entires into some struct page
> > detail? Does a GPU need to do that? Agree that would be better as some
> > C function to take in a KVA instead of a struct page and populate a
> > string with details about that KVA from the struct page.

Though I should have said phys_addr_t not KVA

Jason
Asahi Lina Feb. 4, 2025, 9:06 p.m. UTC | #21
On 2/5/25 5:10 AM, David Hildenbrand wrote:
> On 04.02.25 18:59, Asahi Lina wrote:
>> On 2/4/25 11:38 PM, David Hildenbrand wrote:
>>>>>> If the answer is "no" then that's fine. It's still an unsafe function
>>>>>> and we need to document in the safety section that it should only be
>>>>>> used for memory that is either known to be allocated and pinned and
>>>>>> will
>>>>>> not be freed while the `struct page` is borrowed, or memory that is
>>>>>> reserved and not owned by the buddy allocator, so in practice correct
>>>>>> use would not be racy with memory hot-remove anyway.
>>>>>>
>>>>>> This is already the case for the drm/asahi use case, where the pfns
>>>>>> looked up will only ever be one of:
>>>>>>
>>>>>> - GEM objects that are mapped to the GPU and whose physical pages are
>>>>>> therefore pinned (and the VM is locked while this happens so the
>>>>>> objects
>>>>>> cannot become unpinned out from under the running code),
>>>>>
>>>>> How exactly are these pages pinned/obtained?
>>>>
>>>> Under the hood it's shmem. For pinning, it winds up at
>>>> `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
>>>> a mapping set as unevictable.
>>>
>>> Thanks. So we grab another folio reference via shmem_read_folio_gfp()-
>>>> shmem_get_folio_gfp().
>>>
>>> Hm, I wonder if we might end up holding folios residing in ZONE_MOVABLE/
>>> MIGRATE_CMA longer than we should.
>>>
>>> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes
>>> sure to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
>>>
>>> But that's a different discussion, just pointing it out, maybe I'm
>>> missing something :)
>>
>> I think this is a little over my head. Though I only just realized that
>> we seem to be keeping the GEM objects pinned forever, even after unmap,
>> in the drm-shmem core API (I see no drm-shmem entry point that would
>> allow the sgt to be freed and its corresponding pages ref to be dropped,
>> other than a purge of purgeable objects or final destruction of the
>> object). I'll poke around since this feels wrong, I thought we were
>> supposed to be able to have shrinker support for swapping out whole GPU
>> VMs in the modern GPU MM model, but I guess there's no implementation of
>> that for gem-shmem drivers yet...?
> 
> I recall that shrinker as well, ... or at least a discussion around it.
> 
> [...]
> 
>>>
>>> If it's only for crash dumps etc. that might even be opt-in, it makes
>>> the whole thing a lot less scary. Maybe this could be opt-in somewhere,
>>> to "unlock" this interface? Just an idea.
>>
>> Just to make sure we're on the same page, I don't think there's anything
>> to unlock in the Rust abstraction side (this series). At the end of the
>> day, if nothing else, the unchecked interface (which the regular
>> non-crash page table management code uses for performance) will let you
>> use any pfn you want, it's up to documentation and human review to
>> specify how it should be used by drivers. What Rust gives us here is the
>> mandatory `unsafe {}`, so any attempts to use this API will necessarily
>> stick out during review as potentially dangerous code that needs extra
>> scrutiny.
>>
>> For the client driver itself, I could gate the devcoredump stuff behind
>> a module parameter or something... but I don't think it's really worth
>> it. We don't have a way to reboot the firmware or recover from this
>> condition (platform limitations), so end users are stuck rebooting to
>> get back a usable machine anyway. If something goes wrong in the
>> crashdump code and the machine oopses or locks up worse... it doesn't
>> really make much of a difference for normal end users. I don't think
>> this will ever really happen given the constraints I described, but if
>> somehow it does (some other bug somewhere?), well... the machine was
>> already in an unrecoverable state anyway.
>>
>> It would be nice to have userspace tooling deployed by default that
>> saves off the devcoredump somewhere, so we can have a chance at
>> debugging hard-to-hit firmware crashes... if it's opt-in, it would only
>> really be useful for developers and CI machines.
> 
> Is this something that possibly kdump can save or analyze? Because that
> is our default "oops, kernel crashed, let's dump the old content so we
> can dump it" mechanism on production systems.

kdump does not work on Apple ARM systems because kexec is broken and
cannot be fully fixed, due to multiple platform/firmware limitations. A
very limited version of kexec might work well enough for kdump, but I
don't think anyone has looked into making that work yet...

> but ... I am not familiar with devcoredump. So I don't know when/how it
> runs, and if the source system is still alive (and remains alive --  in
> contrast to a kernel crash).

Devcoredump just makes the dump available via /sys so it can be
collected by the user. The system is still alive, the GPU is just dead
and all future GPU job submissions fail. You can still SSH in or (at
least in theory, if enough moving parts are graceful about it) VT-switch
to a TTY. The display controller is not part of the GPU, it is separate
hardware.

~~ Lina
Asahi Lina Feb. 4, 2025, 9:18 p.m. UTC | #22
On 2/5/25 5:47 AM, David Hildenbrand wrote:
> On 04.02.25 21:41, David Hildenbrand wrote:
>> On 04.02.25 21:26, Jason Gunthorpe wrote:
>>> On Tue, Feb 04, 2025 at 09:05:47PM +0100, David Hildenbrand wrote:
>>>> Fully agreed, this is going into the right direction. Dumping what's
>>>> mapped
>>>> is a different story. Maybe that dumping logic could simply be
>>>> written in C
>>>> for the time being?
>>>
>>> ?
>>>
>>> Isn't dumping just a
>>>     decode pte -> phys_to_virt() -> for_each_u64(virt) -> printk?
>>>
>>
>> IIUC, the problematic bit is that you might not have a directmap such
>> that phys_to_virt() would tell you the whole story.
> 
> ... but it's late and I am confused. For dumping the *page table* that
> would not be required, only when dumping mapped page content (and at
> this point I am not sure if that is a requirement).
> 
> So hopefully Asahi Lina can clarify what the issue was (if there is
> any :) ).

Yes, the crash dumper has to dump the mapped page content. In fact, I
don't care about the page tables themselves other than PTE permission
bits, and the page tables alone are not useful for debugging firmware
crashes (and aren't even included in the dump verbatim, at least not the
kernel-allocated ones). The goal of the crash dumper is to take a
complete dump of firmware virtual memory address space (which includes
the kinds of memory I mentioned in [1]). The output is an ELF format
core dump with all memory that the GPU firmware can access, that the
kernel was able to successfully dump (which should be everything except
for MMIO if the bootloader/DT have the right reserved-memory setup).

I *think* phys_to_virt should work just fine for *this* specific use
case on this platform, but I'm not entirely sure. I still want to use
the various pfn check functions before doing that, to exclude ranges
that would definitely not work.

I don't really want to write any part of the driver in C, but whatever
logic is required, I don't think there should be a good reason for that.
We can export the C functions required to Rust code in some form, we
just need to decide what form that should be. Worst case, Rust drivers
can directly access C functions via the `bindings` crate, though that's
something I want to avoid (even if there's only a very thin or trivial
Rust abstraction layer, there should be *something* so at least we get
Rust docs and the ability to make subtle improvements to the
interface/types/etc).

[1]
https://lore.kernel.org/rust-for-linux/1e9ae833-4293-4e48-83b2-c0af36cb3fdc@asahilina.net/

~~ Lina
Simona Vetter Feb. 5, 2025, 7:40 a.m. UTC | #23
On Tue, Feb 04, 2025 at 03:38:17PM +0100, David Hildenbrand wrote:
> > > It can still race with memory offlining, and it refuses ZONE_DEVICE
> > > pages. For the latter, we have a different way to check validity. See
> > > memory_failure() that first calls pfn_to_online_page() to then check
> > > get_dev_pagemap().
> > 
> > I'll give it a shot with these functions. If they work for my use case,
> > then it's good to have extra checks and I'll add them for v2. Thanks!
> 
> Let me know if you run into any issues.
> 
> > 
> > > 
> > > > 
> > > > If the answer is "no" then that's fine. It's still an unsafe function
> > > > and we need to document in the safety section that it should only be
> > > > used for memory that is either known to be allocated and pinned and will
> > > > not be freed while the `struct page` is borrowed, or memory that is
> > > > reserved and not owned by the buddy allocator, so in practice correct
> > > > use would not be racy with memory hot-remove anyway.
> > > > 
> > > > This is already the case for the drm/asahi use case, where the pfns
> > > > looked up will only ever be one of:
> > > > 
> > > > - GEM objects that are mapped to the GPU and whose physical pages are
> > > > therefore pinned (and the VM is locked while this happens so the objects
> > > > cannot become unpinned out from under the running code),
> > > 
> > > How exactly are these pages pinned/obtained?
> > 
> > Under the hood it's shmem. For pinning, it winds up at
> > `drm_gem_get_pages()`, which I think does a `shmem_read_folio_gfp()` on
> > a mapping set as unevictable.
> 
> Thanks. So we grab another folio reference via
> shmem_read_folio_gfp()->shmem_get_folio_gfp().
> 
> Hm, I wonder if we might end up holding folios residing in
> ZONE_MOVABLE/MIGRATE_CMA longer than we should.
> 
> Compared to memfd_pin_folios(), which simulates FOLL_LONGTERM and makes sure
> to migrate pages out of ZONE_MOVABLE/MIGRATE_CMA.
> 
> But that's a different discussion, just pointing it out, maybe I'm missing
> something :)

Good GPU Drivers (tm) are supposed to have a shrinker so we can at least
nuke some of them again. Some folks even looked into hooking up a migrate
callback through the address_space (or wherever that hook was, this is
from memory) so we can make this somewhat reliable. So yeah we're hogging
ZONE_MOVEABLE unduly still.

The other side is that there's about 2-3 good drivers (msm, i915, xe
should have a shrinker now too but I didn't check). The others all fall
various levels of short, or still have 3 times cargo-culted versions of
i915's pin-as-a-lock design and get it completely wrong.

So yeah I'm aware this isn't great, and we're at least glacially slowly
moving towards a common shrinker infrastructure that maybe in a glorious
future gets all this right. I mean it took us 15+ years to get to a
cgroups controller after all too, and that was also a well known issue of
just being able to hog memory with no controls and potentially cause
havoc.

Cheers, Sima