diff mbox series

arm64/mm: don't WARN when alloc/free-ing device private pages

Message ID 20230406040515.383238-1-jhubbard@nvidia.com (mailing list archive)
State New
Headers show
Series arm64/mm: don't WARN when alloc/free-ing device private pages | expand

Commit Message

John Hubbard April 6, 2023, 4:05 a.m. UTC
Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
functionality was first developed on x86, it also works on arm64.
However, when trying this out on an arm64 system, it turns out that
there is a massive slowdown during the setup and teardown phases.

This slowdown is due to lots of calls to WARN_ON()'s that are checking
for pages that are out of the physical range for the CPU. However,
that's a design feature of device private pages: they are specfically
chosen in order to be outside of the range of the CPU's true physical
pages.

x86 doesn't have this warning. It only checks that pages are properly
aligned. I've shown a comparison below between x86 (which works well)
and arm64 (which has these warnings).

memunmap_pages()
  pageunmap_range()
    if (pgmap->type == MEMORY_DEVICE_PRIVATE)
      __remove_pages()
        __remove_section()
          sparse_remove_section()
            section_deactivate()
              depopulate_section_memmap()
                /* arch/arm64/mm/mmu.c */
                vmemmap_free()
                {
                  WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
                  ...
                }

                /* arch/x86/mm/init_64.c */
                vmemmap_free()
                {
                  VM_BUG_ON(!PAGE_ALIGNED(start));
                  VM_BUG_ON(!PAGE_ALIGNED(end));
                  ...
                }

So, the warning is a false positive for this case. Therefore, skip the
warning if CONFIG_DEVICE_PRIVATE is set.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
cc: <stable@vger.kernel.org>
---
 arch/arm64/mm/mmu.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Ard Biesheuvel April 6, 2023, 7:31 a.m. UTC | #1
Hello John,

On Thu, 6 Apr 2023 at 06:05, John Hubbard <jhubbard@nvidia.com> wrote:
>
> Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
> functionality was first developed on x86, it also works on arm64.
> However, when trying this out on an arm64 system, it turns out that
> there is a massive slowdown during the setup and teardown phases.
>
> This slowdown is due to lots of calls to WARN_ON()'s that are checking
> for pages that are out of the physical range for the CPU. However,
> that's a design feature of device private pages: they are specfically
> chosen in order to be outside of the range of the CPU's true physical
> pages.
>

Currently, the vmemmap region is dimensioned to only cover the PFN
range that backs the linear map. So the WARN() seems appropriate here:
you are mapping struct page[] ranges outside of the allocated window,
and afaict, you might actually wrap around and corrupt the linear map
at the start of the kernel VA space like this.


> x86 doesn't have this warning. It only checks that pages are properly
> aligned. I've shown a comparison below between x86 (which works well)
> and arm64 (which has these warnings).
>
> memunmap_pages()
>   pageunmap_range()
>     if (pgmap->type == MEMORY_DEVICE_PRIVATE)
>       __remove_pages()
>         __remove_section()
>           sparse_remove_section()
>             section_deactivate()
>               depopulate_section_memmap()
>                 /* arch/arm64/mm/mmu.c */
>                 vmemmap_free()
>                 {
>                   WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
>                   ...
>                 }
>
>                 /* arch/x86/mm/init_64.c */
>                 vmemmap_free()
>                 {
>                   VM_BUG_ON(!PAGE_ALIGNED(start));
>                   VM_BUG_ON(!PAGE_ALIGNED(end));
>                   ...
>                 }
>
> So, the warning is a false positive for this case. Therefore, skip the
> warning if CONFIG_DEVICE_PRIVATE is set.
>

I don't think this is a false positive. We'll need to adjust
VMEMMAP_SIZE to account for this.


> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> cc: <stable@vger.kernel.org>
> ---
>  arch/arm64/mm/mmu.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 6f9d8898a025..d5c9b611a8d1 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1157,8 +1157,10 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>                 struct vmem_altmap *altmap)
>  {
> +/* Device private pages are outside of the CPU's physical page range. */
> +#ifndef CONFIG_DEVICE_PRIVATE
>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> -
> +#endif
>         if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>                 return vmemmap_populate_basepages(start, end, node, altmap);
>         else
> @@ -1169,8 +1171,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>                 struct vmem_altmap *altmap)
>  {
> +/* Device private pages are outside of the CPU's physical page range. */
> +#ifndef CONFIG_DEVICE_PRIVATE
>         WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> -
> +#endif
>         unmap_hotplug_range(start, end, true, altmap);
>         free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
>  }
> --
> 2.40.0
>
Andrew Morton April 6, 2023, 8:07 p.m. UTC | #2
On Wed, 5 Apr 2023 21:05:15 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
> functionality was first developed on x86, it also works on arm64.
> However, when trying this out on an arm64 system, it turns out that
> there is a massive slowdown during the setup and teardown phases.
> 
> This slowdown is due to lots of calls to WARN_ON()'s that are checking
> for pages that are out of the physical range for the CPU. However,
> that's a design feature of device private pages: they are specfically
> chosen in order to be outside of the range of the CPU's true physical
> pages.
>
> ...
>
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1157,8 +1157,10 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		struct vmem_altmap *altmap)
>  {
> +/* Device private pages are outside of the CPU's physical page range. */
> +#ifndef CONFIG_DEVICE_PRIVATE
>  	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));

For a simple expression like this to cause a "massive slowdown", I
assume the WARN is triggering.  But changelog doesn't mention massive
dmesg spewage?

Given Ard's comments, perhaps a switch to WARN_ON_ONCE() would suit?
John Hubbard April 6, 2023, 8:18 p.m. UTC | #3
On 4/6/23 13:07, Andrew Morton wrote:
> On Wed, 5 Apr 2023 21:05:15 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> 
>> Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
>> functionality was first developed on x86, it also works on arm64.
>> However, when trying this out on an arm64 system, it turns out that
>> there is a massive slowdown during the setup and teardown phases.
>>
>> This slowdown is due to lots of calls to WARN_ON()'s that are checking
>> for pages that are out of the physical range for the CPU. However,
>> that's a design feature of device private pages: they are specfically
>> chosen in order to be outside of the range of the CPU's true physical
>> pages.
>>
>> ...
>>
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1157,8 +1157,10 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>>  		struct vmem_altmap *altmap)
>>  {
>> +/* Device private pages are outside of the CPU's physical page range. */
>> +#ifndef CONFIG_DEVICE_PRIVATE
>>  	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> 
> For a simple expression like this to cause a "massive slowdown", I
> assume the WARN is triggering.  But changelog doesn't mention massive
> dmesg spewage?

Well, it should. Whoever wrote that needs to improve the changelog. :)

> 
> Given Ard's comments, perhaps a switch to WARN_ON_ONCE() would suit?

That would fix up the user-visible problems, which would be very nice.

Meanwhile, I'm trying to sort out whether this really is a false 
positive for arm64.

thanks,
John Hubbard April 7, 2023, 12:13 a.m. UTC | #4
On 4/6/23 00:31, Ard Biesheuvel wrote:
> Hello John,
> 
> On Thu, 6 Apr 2023 at 06:05, John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
>> functionality was first developed on x86, it also works on arm64.
>> However, when trying this out on an arm64 system, it turns out that
>> there is a massive slowdown during the setup and teardown phases.
>>
>> This slowdown is due to lots of calls to WARN_ON()'s that are checking
>> for pages that are out of the physical range for the CPU. However,
>> that's a design feature of device private pages: they are specfically
>> chosen in order to be outside of the range of the CPU's true physical
>> pages.
>>

Hi Ard,

Thank you for looking at this so soon, I've been working on filling
in some details and studying what you said.

By the way, to address an implicit question from Andrew on the other
thread, the reason that this slows things down is due to a loop in 
__add_pages() that repeatedly calls through to vmemmap_populate(),
like this:

device driver setup: allocate struct pages for the device (GPU)
    memremap_pages(pagemap.type = MEMORY_DEVICE_PRIVATE)
        pagemap_range()
        __add_pages() /* device private case does this */
            for (; pfn < end_pfn; pfn += cur_nr_pages) {
		/* this loops 125 times on an x86 test machine: */
                sparse_add_section()
                    section_activate()
                        populate_section_memmap()
                            __populate_section_memmap()
                                vmemmap_populate()

> 
> Currently, the vmemmap region is dimensioned to only cover the PFN
> range that backs the linear map. So the WARN() seems appropriate here:
> you are mapping struct page[] ranges outside of the allocated window,
> and afaict, you might actually wrap around and corrupt the linear map
> at the start of the kernel VA space like this.
> 

Well...it's only doing a partial hotplug of these pages, just enough to get
ZONE_DEVICE to work. As I understand it so far, only the struct pages
themselves are ever accessed, not any mappings.

More below:

...
>>                 /* arch/x86/mm/init_64.c */
>>                 vmemmap_free()
>>                 {
>>                   VM_BUG_ON(!PAGE_ALIGNED(start));
>>                   VM_BUG_ON(!PAGE_ALIGNED(end));
>>                   ...
>>                 }
>>
>> So, the warning is a false positive for this case. Therefore, skip the
>> warning if CONFIG_DEVICE_PRIVATE is set.
>>
> 
> I don't think this is a false positive. We'll need to adjust
> VMEMMAP_SIZE to account for this.
> 

Looking at the (new to me) arm64 code for this, VMEMMAP_SIZE is only
ever used to calculate VMEMMAP_END, which in turn is used for the
WARN_ON()'s in question, plus as the "ceiling" arg to free_empty_tables().

It seems Mostly Harmless. How would you feel about changing it to a
WARN_ON_ONCE() as Andrew suggested? That would completely fix the
user-visible symptoms:

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6f9d8898a025..82d4205af9f2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1157,7 +1157,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
-	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+	WARN_ON_ONCE((start < VMEMMAP_START) || (end > VMEMMAP_END));
 
 	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
 		return vmemmap_populate_basepages(start, end, node, altmap);
@@ -1169,7 +1169,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
-	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+	WARN_ON_ONCE((start < VMEMMAP_START) || (end > VMEMMAP_END));
 
 	unmap_hotplug_range(start, end, true, altmap);
 	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);



thanks,
Ard Biesheuvel April 7, 2023, 10:45 a.m. UTC | #5
On Fri, 7 Apr 2023 at 02:13, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 4/6/23 00:31, Ard Biesheuvel wrote:
> > Hello John,
> >
> > On Thu, 6 Apr 2023 at 06:05, John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> Although CONFIG_DEVICE_PRIVATE and hmm_range_fault() and related
> >> functionality was first developed on x86, it also works on arm64.
> >> However, when trying this out on an arm64 system, it turns out that
> >> there is a massive slowdown during the setup and teardown phases.
> >>
> >> This slowdown is due to lots of calls to WARN_ON()'s that are checking
> >> for pages that are out of the physical range for the CPU. However,
> >> that's a design feature of device private pages: they are specfically
> >> chosen in order to be outside of the range of the CPU's true physical
> >> pages.
> >>
>
> Hi Ard,
>
> Thank you for looking at this so soon, I've been working on filling
> in some details and studying what you said.
>
> By the way, to address an implicit question from Andrew on the other
> thread, the reason that this slows things down is due to a loop in
> __add_pages() that repeatedly calls through to vmemmap_populate(),
> like this:
>
> device driver setup: allocate struct pages for the device (GPU)
>     memremap_pages(pagemap.type = MEMORY_DEVICE_PRIVATE)
>         pagemap_range()
>         __add_pages() /* device private case does this */
>             for (; pfn < end_pfn; pfn += cur_nr_pages) {
>                 /* this loops 125 times on an x86 test machine: */
>                 sparse_add_section()
>                     section_activate()
>                         populate_section_memmap()
>                             __populate_section_memmap()
>                                 vmemmap_populate()
>
> >
> > Currently, the vmemmap region is dimensioned to only cover the PFN
> > range that backs the linear map. So the WARN() seems appropriate here:
> > you are mapping struct page[] ranges outside of the allocated window,
> > and afaict, you might actually wrap around and corrupt the linear map
> > at the start of the kernel VA space like this.
> >
>
> Well...it's only doing a partial hotplug of these pages, just enough to get
> ZONE_DEVICE to work. As I understand it so far, only the struct pages
> themselves are ever accessed, not any mappings.
>

That is what I am talking about - the struct pages are allocated in a
region that is reserved for something else.

Maybe an example helps here:

When running the 39-bit VA kernel build on a AMD Seatte board, we will
have (assuming sizeof(struct page) == 64)

memstart_addr := 0x80_0000_0000

PAGE_OFFSET := 0xffff_ff80_0000_0000

VMEMMAP_SHIFT := 6
VMEMMAP_START := 0xffff_fffe_0000_0000
VMEMMAP_SIZE := 0x1_0000_0000

pfn_to_page() conversions are based on ordinary array indexing
involving vemmap[], where vmemmap is defined as

#define vmemmap \
    ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))

So the PFN associated with the first usable DRAM address is
0x800_0000, and pfn_to_page(0x800_0000) will return VMEMMAP_START.

pfn_to_page(x) for any x < 0x800_0000 will produce a kernel VA that
points into the vmalloc area, and may conflict with the kernel
mapping, modules mappings, per-CPU mappings, IO mappings, etc etc.

pfn_to_page(x) for values 0xc00_0000 < x < 0x1000_0000 will produce a
kernel VA that points outside the region set aside for the vmemmap.
This region is currently unused, but that will likely change soon.

pfn_to_page(x) for any x >= 0x1000_0000 will wrap around and produce a
bogus address in the user range.

The bottom line is that the VMEMMAP region is dimensioned to cover
system memory only, i.e., what can be covered by the kernel direct
map. If you want to allocate struct pages for thing that are not
system memory, you will need to enlarge the VMEMMAP region, and ensure
that request_mem_region() produces a region that is covered by it.

This is going to be tricky with LPA2, because there, the 4k pages
configuration already uses up half of the vmalloc region to cover the
linear map, so we have to consider this carefully.


> More below:
>
> ...
> >>                 /* arch/x86/mm/init_64.c */
> >>                 vmemmap_free()
> >>                 {
> >>                   VM_BUG_ON(!PAGE_ALIGNED(start));
> >>                   VM_BUG_ON(!PAGE_ALIGNED(end));
> >>                   ...
> >>                 }
> >>
> >> So, the warning is a false positive for this case. Therefore, skip the
> >> warning if CONFIG_DEVICE_PRIVATE is set.
> >>
> >
> > I don't think this is a false positive. We'll need to adjust
> > VMEMMAP_SIZE to account for this.
> >
>
> Looking at the (new to me) arm64 code for this, VMEMMAP_SIZE is only
> ever used to calculate VMEMMAP_END, which in turn is used for the
> WARN_ON()'s in question, plus as the "ceiling" arg to free_empty_tables().
>
> It seems Mostly Harmless. How would you feel about changing it to a
> WARN_ON_ONCE() as Andrew suggested? That would completely fix the
> user-visible symptoms:
>

Actually, it should be a hard error, given the above.

> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 6f9d8898a025..82d4205af9f2 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1157,7 +1157,7 @@ int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
>  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>                 struct vmem_altmap *altmap)
>  {
> -       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> +       WARN_ON_ONCE((start < VMEMMAP_START) || (end > VMEMMAP_END));
>
>         if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
>                 return vmemmap_populate_basepages(start, end, node, altmap);
> @@ -1169,7 +1169,7 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  void vmemmap_free(unsigned long start, unsigned long end,
>                 struct vmem_altmap *altmap)
>  {
> -       WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
> +       WARN_ON_ONCE((start < VMEMMAP_START) || (end > VMEMMAP_END));
>
>         unmap_hotplug_range(start, end, true, altmap);
>         free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>
John Hubbard April 10, 2023, 7:39 a.m. UTC | #6
On 4/7/23 03:45, Ard Biesheuvel wrote:
...
> That is what I am talking about - the struct pages are allocated in a
> region that is reserved for something else.
> 
> Maybe an example helps here:

It does! After also poking around quite a lot, and comparing to x86,
it is starting to become clearer now.

> 
> When running the 39-bit VA kernel build on a AMD Seatte board, we will
> have (assuming sizeof(struct page) == 64)
> 
> memstart_addr := 0x80_0000_0000
> 
> PAGE_OFFSET := 0xffff_ff80_0000_0000
> 
> VMEMMAP_SHIFT := 6
> VMEMMAP_START := 0xffff_fffe_0000_0000
> VMEMMAP_SIZE := 0x1_0000_0000
> 
> pfn_to_page() conversions are based on ordinary array indexing
> involving vemmap[], where vmemmap is defined as
> 
> #define vmemmap \
>     ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
> 
> So the PFN associated with the first usable DRAM address is
> 0x800_0000, and pfn_to_page(0x800_0000) will return VMEMMAP_START.

OK, I see how that's set up, yes.

> 
> pfn_to_page(x) for any x < 0x800_0000 will produce a kernel VA that
> points into the vmalloc area, and may conflict with the kernel
> mapping, modules mappings, per-CPU mappings, IO mappings, etc etc.
> 
> pfn_to_page(x) for values 0xc00_0000 < x < 0x1000_0000 will produce a
> kernel VA that points outside the region set aside for the vmemmap.
> This region is currently unused, but that will likely change soon.
> 

I tentatively think I'm in this case right now. Because there is no wrap
around happening in my particular config, which is CONFIG_ARM64_VA_BITS
== 48, and PAGE_SIZE == 4KB and sizeof(struct page) == 64 (details
below).

It occurs to me that ZONE_DEVICE and (within that category) device
private page support need only support rather large setups. On x86, it
requires 64-bit. And on arm64, from what I'm learning after a day or so
of looking around and comparing, I think we must require at least 48 bit
VA support. Otherwise there's just no room for things.

And for smaller systems, everyone disables this fancy automatic handling
(hmm_range_fault()-based page migration) anyway, partly because of the
VA and PA small ranges, but also because of size and performance
constraints.

> pfn_to_page(x) for any x >= 0x1000_0000 will wrap around and produce a
> bogus address in the user range.
> 
> The bottom line is that the VMEMMAP region is dimensioned to cover
> system memory only, i.e., what can be covered by the kernel direct
> map. If you want to allocate struct pages for thing that are not
> system memory, you will need to enlarge the VMEMMAP region, and ensure
> that request_mem_region() produces a region that is covered by it.
> 
> This is going to be tricky with LPA2, because there, the 4k pages
> configuration already uses up half of the vmalloc region to cover the
> linear map, so we have to consider this carefully.

Things are interlocked a little differently on arm64, than on x86, and
the layout is also different. One other interesting thing jumps out at
me: On arm64, the (VMALLOC_END - VMALLOC_START) size is *huge*: 123 TB
on my config. And it seems to cover the kernel mapping. On x86, those
are separate. This still confuses me a bit and I wonder if I'm reading
it wrong?

Also, below are the values on my 48 bit VA setup. I'm listing these in
order to help jumpstart thinking about how exactly to extend
VMEMMAP_SIZE. GPUs have on the order of GB's of memory these days, so
that's the order of magnitude that's needed.

PAGE_OFFSET:                      0xffff000000000000
PAGE_END:                         0xffff800000000000
high_memory:                      0xffff087f80000000 (8 TB)

VMALLOC_START:                    0xffff800008000000
VMALLOC_END:                      0xfffffbfff0000000 (123 TB)

vmemmap:                          0xfffffbfffe000000
VMEMMAP_START:                    0xfffffc0000000000
VMEMMAP_END:                      0xfffffe0000000000

Typical device private struct page
that is causing warnings:         0xffffffffaee00000

VMEMMAP_SIZE:                     0x0000020000000000 (2 TB)
VMEMMAP_SHIFT:                    6

PHYS_OFFSET:                      0x0000000080000000
memstart_addr (signed 64-bit):    0x0000000080000000

MODULES_VADDR:                    0xffff800000000000
MODULES_END:                      0xffff800008000000 (128 MB)

PAGE_SHIFT:                       12
PAGE_SIZE:                        0x0000000000001000 (4 KB)
PAGE_MASK:                        0xfffffffffffff000

PMD_SIZE:                         0x0000000000200000 (2 MB)
PMD_MASK:                         0xffffffffffe00000

PUD_SIZE:                         0x0000000040000000 (1 GB)
PUD_MASK:                         0xffffffffc0000000

PGDIR_SIZE:                       0x0000008000000000 (512 GB)

PTE_ADDR_MASK:                    0x0000fffffffff000
sizeof(struct page):              64

thanks,
John Hubbard April 11, 2023, 2:48 a.m. UTC | #7
On 4/10/23 00:39, John Hubbard wrote:
>> pfn_to_page(x) for values 0xc00_0000 < x < 0x1000_0000 will produce a
>> kernel VA that points outside the region set aside for the vmemmap.
>> This region is currently unused, but that will likely change soon.
>>
> 
> I tentatively think I'm in this case right now. Because there is no wrap
> around happening in my particular config, which is CONFIG_ARM64_VA_BITS
> == 48, and PAGE_SIZE == 4KB and sizeof(struct page) == 64 (details
> below).
> 

Correction, actually it *is* wrapping around, and ending up as a bogus
user space address, as you said it would when being above the range:

page_to_pfn(0xffffffffaee00000):  0x0000000ffec38000


> It occurs to me that ZONE_DEVICE and (within that category) device
> private page support need only support rather large setups. On x86, it
> requires 64-bit. And on arm64, from what I'm learning after a day or so
> of looking around and comparing, I think we must require at least 48 bit
> VA support. Otherwise there's just no room for things.

I'm still not sure of how to make room, but working on it.


thanks,
Ard Biesheuvel May 12, 2023, 2:42 p.m. UTC | #8
On Tue, 11 Apr 2023 at 04:48, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 4/10/23 00:39, John Hubbard wrote:
> >> pfn_to_page(x) for values 0xc00_0000 < x < 0x1000_0000 will produce a
> >> kernel VA that points outside the region set aside for the vmemmap.
> >> This region is currently unused, but that will likely change soon.
> >>
> >
> > I tentatively think I'm in this case right now. Because there is no wrap
> > around happening in my particular config, which is CONFIG_ARM64_VA_BITS
> > == 48, and PAGE_SIZE == 4KB and sizeof(struct page) == 64 (details
> > below).
> >
>
> Correction, actually it *is* wrapping around, and ending up as a bogus
> user space address, as you said it would when being above the range:
>
> page_to_pfn(0xffffffffaee00000):  0x0000000ffec38000
>

Interesting.

>
> > It occurs to me that ZONE_DEVICE and (within that category) device
> > private page support need only support rather large setups. On x86, it
> > requires 64-bit. And on arm64, from what I'm learning after a day or so
> > of looking around and comparing, I think we must require at least 48 bit
> > VA support. Otherwise there's just no room for things.
>
> I'm still not sure of how to make room, but working on it.
>

The assumption that only the linear map needs to be covered by struct
pages is rather fundamental to the arm64 mm code, as far as I am
aware.

Given that these device memory regions have no direct correspondence
with the linear map at all, couldn't we simply vmalloc() a range of
memory for struct pages for such a region and wire that up in the
existing code? That seems far more maintainable to me than
reorganizing the entire kernel VA space, and only for some choices for
the dimensions.
John Hubbard May 13, 2023, 2:06 a.m. UTC | #9
On 5/12/23 07:42, Ard Biesheuvel wrote:
>> I'm still not sure of how to make room, but working on it.
>>
> 
> The assumption that only the linear map needs to be covered by struct
> pages is rather fundamental to the arm64 mm code, as far as I am
> aware.
> 
> Given that these device memory regions have no direct correspondence
> with the linear map at all, couldn't we simply vmalloc() a range of
> memory for struct pages for such a region and wire that up in the
> existing code? That seems far more maintainable to me than
> reorganizing the entire kernel VA space, and only for some choices for
> the dimensions.

The vmalloc approach does sound like it should Just Work, yes. I'll try it out.

And now I'm trying to remember why Jerome didn't use that approach for x86
originally. If this fixes HMM on arm64, I'll revisit that question too.

Really appreciate the help and advice here.


thanks,
diff mbox series

Patch

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6f9d8898a025..d5c9b611a8d1 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1157,8 +1157,10 @@  int __meminit vmemmap_check_pmd(pmd_t *pmdp, int node,
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		struct vmem_altmap *altmap)
 {
+/* Device private pages are outside of the CPU's physical page range. */
+#ifndef CONFIG_DEVICE_PRIVATE
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
-
+#endif
 	if (!IS_ENABLED(CONFIG_ARM64_4K_PAGES))
 		return vmemmap_populate_basepages(start, end, node, altmap);
 	else
@@ -1169,8 +1171,10 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 void vmemmap_free(unsigned long start, unsigned long end,
 		struct vmem_altmap *altmap)
 {
+/* Device private pages are outside of the CPU's physical page range. */
+#ifndef CONFIG_DEVICE_PRIVATE
 	WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
-
+#endif
 	unmap_hotplug_range(start, end, true, altmap);
 	free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
 }