diff mbox series

[6/6] arm64/mm: Enable ZONE_DEVICE

Message ID 1554265806-11501-7-git-send-email-anshuman.khandual@arm.com (mailing list archive)
State New, archived
Headers show
Series arm64/mm: Enable memory hot remove and ZONE_DEVICE | expand

Commit Message

Anshuman Khandual April 3, 2019, 4:30 a.m. UTC
Arch implementation for functions which create or destroy vmemmap mapping
(vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
device memory range through driver provided vmem_altmap structure which
fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
creates vmemmap section mappings and utilize vmem_altmap structure.

Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

Robin Murphy April 3, 2019, 1:58 p.m. UTC | #1
[ +Dan, Jerome ]

On 03/04/2019 05:30, Anshuman Khandual wrote:
> Arch implementation for functions which create or destroy vmemmap mapping
> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> device memory range through driver provided vmem_altmap structure which
> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just

ZONE_DEVICE is about more than just altmap support, no?

> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> creates vmemmap section mappings and utilize vmem_altmap structure.

What prevents it from working with other page sizes? One of the foremost 
use-cases for our 52-bit VA/PA support is to enable mapping large 
quantities of persistent memory, so we really do need this for 64K pages 
too. FWIW, it appears not to be an issue for PowerPC.

> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>   arch/arm64/Kconfig | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index db3e625..b5d8cf5 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -31,6 +31,7 @@ config ARM64
>   	select ARCH_HAS_SYSCALL_WRAPPER
>   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES

IIRC certain configurations (HMM?) don't even build if you just turn 
this on alone (although of course things may have changed elsewhere in 
the meantime) - crucially, though, from previous discussions[1] it seems 
fundamentally unsafe, since I don't think we can guarantee that nobody 
will touch the corners of ZONE_DEVICE that also require pte_devmap in 
order not to go subtly wrong. I did get as far as cooking up some 
patches to sort that out [2][3] which I never got round to posting for 
their own sake, so please consider picking those up as part of this series.

Robin.

>   	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   	select ARCH_INLINE_READ_LOCK if !PREEMPT
>   	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> 


[1] 
https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
[2] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
[3] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
(apologies to anyone if the linux-arm.org server is being flaky as usual 
and requires a few tries to respond properly)
Jerome Glisse April 3, 2019, 4:07 p.m. UTC | #2
On Wed, Apr 03, 2019 at 02:58:28PM +0100, Robin Murphy wrote:
> [ +Dan, Jerome ]
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
> > Arch implementation for functions which create or destroy vmemmap mapping
> > (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> > device memory range through driver provided vmem_altmap structure which
> > fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> 
> ZONE_DEVICE is about more than just altmap support, no?
> 
> > enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> > applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> > creates vmemmap section mappings and utilize vmem_altmap structure.
> 
> What prevents it from working with other page sizes? One of the foremost
> use-cases for our 52-bit VA/PA support is to enable mapping large quantities
> of persistent memory, so we really do need this for 64K pages too. FWIW, it
> appears not to be an issue for PowerPC.
> 
> > Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> > ---
> >   arch/arm64/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index db3e625..b5d8cf5 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -31,6 +31,7 @@ config ARM64
> >   	select ARCH_HAS_SYSCALL_WRAPPER
> >   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
> >   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> > +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> 
> IIRC certain configurations (HMM?) don't even build if you just turn this on
> alone (although of course things may have changed elsewhere in the meantime)
> - crucially, though, from previous discussions[1] it seems fundamentally
> unsafe, since I don't think we can guarantee that nobody will touch the
> corners of ZONE_DEVICE that also require pte_devmap in order not to go
> subtly wrong. I did get as far as cooking up some patches to sort that out
> [2][3] which I never got round to posting for their own sake, so please
> consider picking those up as part of this series.

Correct _do not_ enable ZONE_DEVICE without support for pte_devmap detection.
If you want some feature of ZONE_DEVICE. Like HMM as while DAX does require
pte_devmap, HMM device private does not. So you would first have to split
ZONE_DEVICE into more sub-features kconfig option.

What is the end use case you are looking for ? Persistent memory ?

Cheers,
Jérôme
Anshuman Khandual April 4, 2019, 4:42 a.m. UTC | #3
On 04/03/2019 07:28 PM, Robin Murphy wrote:
> [ +Dan, Jerome ]
> 
> On 03/04/2019 05:30, Anshuman Khandual wrote:
>> Arch implementation for functions which create or destroy vmemmap mapping
>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>> device memory range through driver provided vmem_altmap structure which
>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> 
> ZONE_DEVICE is about more than just altmap support, no?

Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
struct pages for it has stand alone and self contained use case. The driver could
just want to manage the memory itself but with struct pages either in the RAM or
in the device memory range through struct vmem_altmap. The driver may not choose
to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
map these pages into any user pagetable which would necessitate support for
pte|pmd|pud_devmap.

Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
IMHO ZONE_DEVICE is self contained and can be evaluated in itself.

> 
>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>> creates vmemmap section mappings and utilize vmem_altmap structure.
> 
> What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.


On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
does not support struct vmem_altmap right now. Originally was planning to send
the vmemmap_populate_basepages() enablement patches separately but will post it
here for review.

> 
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>>   arch/arm64/Kconfig | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index db3e625..b5d8cf5 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -31,6 +31,7 @@ config ARM64
>>       select ARCH_HAS_SYSCALL_WRAPPER
>>       select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>       select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> 
> IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.

In the previous discussion mentioned here [1] it sort of indicates that we
cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
dont understand why ! The driver can just hotplug the range into ZONE_DEVICE,
manage the memory itself without mapping them to user page table ever. IIUC
ZONE_DEVICE must not need user mapped device PFN support. All the corner case
problems discussed previously come in once these new 'device PFN' memory which
is now in ZONE_DEVICE get mapped into user page table.

> 
> Robin.
> 
>>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>       select ARCH_INLINE_READ_LOCK if !PREEMPT
>>       select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
>>
> 
> 
> [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
> [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
> [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
> (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)

I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
I still dont understand why ZONE_DEVICE can not be enabled and used from a
driver which never requires user mapping or pte|pmd|pud_devmap() support.
Anshuman Khandual April 4, 2019, 5:03 a.m. UTC | #4
On 04/03/2019 09:37 PM, Jerome Glisse wrote:
> On Wed, Apr 03, 2019 at 02:58:28PM +0100, Robin Murphy wrote:
>> [ +Dan, Jerome ]
>>
>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>> Arch implementation for functions which create or destroy vmemmap mapping
>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>> device memory range through driver provided vmem_altmap structure which
>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>
>> ZONE_DEVICE is about more than just altmap support, no?
>>
>>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>>> creates vmemmap section mappings and utilize vmem_altmap structure.
>>
>> What prevents it from working with other page sizes? One of the foremost
>> use-cases for our 52-bit VA/PA support is to enable mapping large quantities
>> of persistent memory, so we really do need this for 64K pages too. FWIW, it
>> appears not to be an issue for PowerPC.
>>
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>> ---
>>>   arch/arm64/Kconfig | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index db3e625..b5d8cf5 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -31,6 +31,7 @@ config ARM64
>>>   	select ARCH_HAS_SYSCALL_WRAPPER
>>>   	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>>   	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>>> +	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
>>
>> IIRC certain configurations (HMM?) don't even build if you just turn this on
>> alone (although of course things may have changed elsewhere in the meantime)
>> - crucially, though, from previous discussions[1] it seems fundamentally
>> unsafe, since I don't think we can guarantee that nobody will touch the
>> corners of ZONE_DEVICE that also require pte_devmap in order not to go
>> subtly wrong. I did get as far as cooking up some patches to sort that out
>> [2][3] which I never got round to posting for their own sake, so please
>> consider picking those up as part of this series.
> 
> Correct _do not_ enable ZONE_DEVICE without support for pte_devmap detection.

Driver managed ZONE_DEVICE memory which never maps into user page table is not
a valid use case for ZONE_DEVICE ? Also what about MEMORY_DEVICE_PRIVATE ? That
can never be mapped into user page table. A driver can still manage these non
coherent memory through it's struct pages (which will be allocated inside RAM)

> If you want some feature of ZONE_DEVICE. Like HMM as while DAX does require
> pte_devmap, HMM device private does not. So you would first have to split
> ZONE_DEVICE into more sub-features kconfig option.

CONFIG_ZONE_DEVICE does not do that already ! All it says is that a device
memory range can be plugged into ZONE_DEVICE either as PRIVATE (non-coherent)
or PUBLIC/PCI_P2PDMA (coherent) memory without mandating anything about how
these memory will be further used.

> 
> What is the end use case you are looking for ? Persistent memory ?

Persistent memory is one of the primary use cases.
Dan Williams April 4, 2019, 5:04 a.m. UTC | #5
On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
<anshuman.khandual@arm.com> wrote:
>
>
>
> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> > [ +Dan, Jerome ]
> >
> > On 03/04/2019 05:30, Anshuman Khandual wrote:
> >> Arch implementation for functions which create or destroy vmemmap mapping
> >> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> >> device memory range through driver provided vmem_altmap structure which
> >> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> >
> > ZONE_DEVICE is about more than just altmap support, no?
>
> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> struct pages for it has stand alone and self contained use case. The driver could
> just want to manage the memory itself but with struct pages either in the RAM or
> in the device memory range through struct vmem_altmap. The driver may not choose
> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> map these pages into any user pagetable which would necessitate support for
> pte|pmd|pud_devmap.

What's left for ZONE_DEVICE if none of the above cases are used?

> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.

I'm not convinced. What's the specific use case.

>
> >
> >> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
> >> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
> >> creates vmemmap section mappings and utilize vmem_altmap structure.
> >
> > What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.
>
>
> On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
> does not support struct vmem_altmap right now. Originally was planning to send
> the vmemmap_populate_basepages() enablement patches separately but will post it
> here for review.
>
> >
> >> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> >> ---
> >>   arch/arm64/Kconfig | 1 +
> >>   1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >> index db3e625..b5d8cf5 100644
> >> --- a/arch/arm64/Kconfig
> >> +++ b/arch/arm64/Kconfig
> >> @@ -31,6 +31,7 @@ config ARM64
> >>       select ARCH_HAS_SYSCALL_WRAPPER
> >>       select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
> >>       select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> >> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
> >
> > IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.
>
> In the previous discussion mentioned here [1] it sort of indicates that we
> cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
> dont understand why !

Because ZONE_DEVICE was specifically invented to solve get_user_pages() for DAX.

> The driver can just hotplug the range into ZONE_DEVICE,
> manage the memory itself without mapping them to user page table ever.

Then why do you even need 'struct page' objects?

> IIUC
> ZONE_DEVICE must not need user mapped device PFN support.

No, you don't understand correctly, or I don't understand how you plan
to use ZONE_DEVICE outside it's intended use case.

> All the corner case
> problems discussed previously come in once these new 'device PFN' memory which
> is now in ZONE_DEVICE get mapped into user page table.
>
> >
> > Robin.
> >
> >>       select ARCH_HAVE_NMI_SAFE_CMPXCHG
> >>       select ARCH_INLINE_READ_LOCK if !PREEMPT
> >>       select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
> >>
> >
> >
> > [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
> > [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
> > [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
> > (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)
>
> I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
> I still dont understand why ZONE_DEVICE can not be enabled and used from a
> driver which never requires user mapping or pte|pmd|pud_devmap() support.

Because there are mm paths that make assumptions about ZONE_DEVICE
that your use case might violate.
Robin Murphy April 4, 2019, 9:46 a.m. UTC | #6
On 04/04/2019 06:04, Dan Williams wrote:
> On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
>>
>>
>>
>> On 04/03/2019 07:28 PM, Robin Murphy wrote:
>>> [ +Dan, Jerome ]
>>>
>>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>>> Arch implementation for functions which create or destroy vmemmap mapping
>>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>>> device memory range through driver provided vmem_altmap structure which
>>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>>
>>> ZONE_DEVICE is about more than just altmap support, no?
>>
>> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
>> struct pages for it has stand alone and self contained use case. The driver could
>> just want to manage the memory itself but with struct pages either in the RAM or
>> in the device memory range through struct vmem_altmap. The driver may not choose
>> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
>> map these pages into any user pagetable which would necessitate support for
>> pte|pmd|pud_devmap.
> 
> What's left for ZONE_DEVICE if none of the above cases are used?
> 
>> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
>> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> 
> I'm not convinced. What's the specific use case.

The fundamental "roadmap" reason we've been doing this is to enable 
further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact 
that ZONE_DEVICE immediately opens the door to the various other stuff 
that the CCIX folks have interest in is a definite bonus, so it would 
certainly be preferable to get arm64 on par with the current state of 
things rather than try to subdivide the scope further.

I started working on this from the ZONE_DEVICE end, but got bogged down 
in trying to replace my copied-from-s390 dummy hot-remove implementation 
with something proper. Anshuman has stepped in to help with hot-remove 
(since we also have cloud folks wanting that for its own sake), so is 
effectively coming at the problem from the opposite direction, and I'll 
be the first to admit that we've not managed the greatest job of meeting 
in the middle and coordinating our upstream story; sorry about that :)

Let me freshen up my devmap patches and post them properly, since that 
discussion doesn't have to happen in the context of hot-remove; they're 
effectively just parallel dependencies for ZONE_DEVICE.

Robin.

>>
>>>
>>>> enable ZONE_DEVICE by subscribing to ARCH_HAS_ZONE_DEVICE. But this is only
>>>> applicable for ARM64_4K_PAGES (ARM64_SWAPPER_USES_SECTION_MAPS) only which
>>>> creates vmemmap section mappings and utilize vmem_altmap structure.
>>>
>>> What prevents it from working with other page sizes? One of the foremost use-cases for our 52-bit VA/PA support is to enable mapping large quantities of persistent memory, so we really do need this for 64K pages too. FWIW, it appears not to be an issue for PowerPC.
>>
>>
>> On !AR64_4K_PAGES vmemmap_populate() calls vmemmap_populate_basepages() which
>> does not support struct vmem_altmap right now. Originally was planning to send
>> the vmemmap_populate_basepages() enablement patches separately but will post it
>> here for review.
>>
>>>
>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> ---
>>>>    arch/arm64/Kconfig | 1 +
>>>>    1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index db3e625..b5d8cf5 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -31,6 +31,7 @@ config ARM64
>>>>        select ARCH_HAS_SYSCALL_WRAPPER
>>>>        select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
>>>>        select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>>>> +    select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
>>>
>>> IIRC certain configurations (HMM?) don't even build if you just turn this on alone (although of course things may have changed elsewhere in the meantime) - crucially, though, from previous discussions[1] it seems fundamentally unsafe, since I don't think we can guarantee that nobody will touch the corners of ZONE_DEVICE that also require pte_devmap in order not to go subtly wrong. I did get as far as cooking up some patches to sort that out [2][3] which I never got round to posting for their own sake, so please consider picking those up as part of this series.
>>
>> In the previous discussion mentioned here [1] it sort of indicates that we
>> cannot have a viable (ARCH_HAS_ZONE_DEVICE=y but !__HAVE_ARCH_PTE_DEVMAP). I
>> dont understand why !
> 
> Because ZONE_DEVICE was specifically invented to solve get_user_pages() for DAX.
> 
>> The driver can just hotplug the range into ZONE_DEVICE,
>> manage the memory itself without mapping them to user page table ever.
> 
> Then why do you even need 'struct page' objects?
> 
>> IIUC
>> ZONE_DEVICE must not need user mapped device PFN support.
> 
> No, you don't understand correctly, or I don't understand how you plan
> to use ZONE_DEVICE outside it's intended use case.
> 
>> All the corner case
>> problems discussed previously come in once these new 'device PFN' memory which
>> is now in ZONE_DEVICE get mapped into user page table.
>>
>>>
>>> Robin.
>>>
>>>>        select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>>>        select ARCH_INLINE_READ_LOCK if !PREEMPT
>>>>        select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
>>>>
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAA9_cmfA9GS+1M1aSyv1ty5jKY3iho3CERhnRAruWJW3PfmpgA@mail.gmail.com/#t
>>> [2] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=61816b833afdb56b49c2e58f5289ae18809e5d67
>>> [3] http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a5a16560eb1becf9a1d4cc0d03d6b5e76da4f4e1
>>> (apologies to anyone if the linux-arm.org server is being flaky as usual and requires a few tries to respond properly)
>>
>> I have not evaluated pte_devmap(). Will consider [3] when enabling it. But
>> I still dont understand why ZONE_DEVICE can not be enabled and used from a
>> driver which never requires user mapping or pte|pmd|pud_devmap() support.
> 
> Because there are mm paths that make assumptions about ZONE_DEVICE
> that your use case might violate.
>
Dan Williams April 7, 2019, 10:11 p.m. UTC | #7
On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 04/04/2019 06:04, Dan Williams wrote:
> > On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> > <anshuman.khandual@arm.com> wrote:
> >>
> >>
> >>
> >> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> >>> [ +Dan, Jerome ]
> >>>
> >>> On 03/04/2019 05:30, Anshuman Khandual wrote:
> >>>> Arch implementation for functions which create or destroy vmemmap mapping
> >>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> >>>> device memory range through driver provided vmem_altmap structure which
> >>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> >>>
> >>> ZONE_DEVICE is about more than just altmap support, no?
> >>
> >> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> >> struct pages for it has stand alone and self contained use case. The driver could
> >> just want to manage the memory itself but with struct pages either in the RAM or
> >> in the device memory range through struct vmem_altmap. The driver may not choose
> >> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> >> map these pages into any user pagetable which would necessitate support for
> >> pte|pmd|pud_devmap.
> >
> > What's left for ZONE_DEVICE if none of the above cases are used?
> >
> >> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> >> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> >
> > I'm not convinced. What's the specific use case.
>
> The fundamental "roadmap" reason we've been doing this is to enable
> further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
> that ZONE_DEVICE immediately opens the door to the various other stuff
> that the CCIX folks have interest in is a definite bonus, so it would
> certainly be preferable to get arm64 on par with the current state of
> things rather than try to subdivide the scope further.
>
> I started working on this from the ZONE_DEVICE end, but got bogged down
> in trying to replace my copied-from-s390 dummy hot-remove implementation
> with something proper. Anshuman has stepped in to help with hot-remove
> (since we also have cloud folks wanting that for its own sake), so is
> effectively coming at the problem from the opposite direction, and I'll
> be the first to admit that we've not managed the greatest job of meeting
> in the middle and coordinating our upstream story; sorry about that :)
>
> Let me freshen up my devmap patches and post them properly, since that
> discussion doesn't have to happen in the context of hot-remove; they're
> effectively just parallel dependencies for ZONE_DEVICE.

Sounds good. It's also worth noting that Ira's recent patches for
supporting get_user_pages_fast() for "longterm" pins relies on
PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
it needs to fall back to slow-GUP. So it really is the case that
"devmap" support is an assumption for ZONE_DEVICE.
Ira Weiny April 8, 2019, 4:03 a.m. UTC | #8
On Sun, Apr 07, 2019 at 03:11:00PM -0700, Dan Williams wrote:
> On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 04/04/2019 06:04, Dan Williams wrote:
> > > On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >>
> > >>
> > >> On 04/03/2019 07:28 PM, Robin Murphy wrote:
> > >>> [ +Dan, Jerome ]
> > >>>
> > >>> On 03/04/2019 05:30, Anshuman Khandual wrote:
> > >>>> Arch implementation for functions which create or destroy vmemmap mapping
> > >>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
> > >>>> device memory range through driver provided vmem_altmap structure which
> > >>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
> > >>>
> > >>> ZONE_DEVICE is about more than just altmap support, no?
> > >>
> > >> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
> > >> struct pages for it has stand alone and self contained use case. The driver could
> > >> just want to manage the memory itself but with struct pages either in the RAM or
> > >> in the device memory range through struct vmem_altmap. The driver may not choose
> > >> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
> > >> map these pages into any user pagetable which would necessitate support for
> > >> pte|pmd|pud_devmap.
> > >
> > > What's left for ZONE_DEVICE if none of the above cases are used?
> > >
> > >> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
> > >> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
> > >
> > > I'm not convinced. What's the specific use case.
> >
> > The fundamental "roadmap" reason we've been doing this is to enable
> > further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
> > that ZONE_DEVICE immediately opens the door to the various other stuff
> > that the CCIX folks have interest in is a definite bonus, so it would
> > certainly be preferable to get arm64 on par with the current state of
> > things rather than try to subdivide the scope further.
> >
> > I started working on this from the ZONE_DEVICE end, but got bogged down
> > in trying to replace my copied-from-s390 dummy hot-remove implementation
> > with something proper. Anshuman has stepped in to help with hot-remove
> > (since we also have cloud folks wanting that for its own sake), so is
> > effectively coming at the problem from the opposite direction, and I'll
> > be the first to admit that we've not managed the greatest job of meeting
> > in the middle and coordinating our upstream story; sorry about that :)
> >
> > Let me freshen up my devmap patches and post them properly, since that
> > discussion doesn't have to happen in the context of hot-remove; they're
> > effectively just parallel dependencies for ZONE_DEVICE.
> 
> Sounds good. It's also worth noting that Ira's recent patches for
> supporting get_user_pages_fast() for "longterm" pins relies on
> PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
> it needs to fall back to slow-GUP. So it really is the case that
> "devmap" support is an assumption for ZONE_DEVICE.

Could you cc me on the patches when you post?

Thanks,
Ira
Anshuman Khandual April 8, 2019, 6:03 a.m. UTC | #9
On 04/08/2019 09:33 AM, Ira Weiny wrote:
> On Sun, Apr 07, 2019 at 03:11:00PM -0700, Dan Williams wrote:
>> On Thu, Apr 4, 2019 at 2:47 AM Robin Murphy <robin.murphy@arm.com> wrote:
>>>
>>> On 04/04/2019 06:04, Dan Williams wrote:
>>>> On Wed, Apr 3, 2019 at 9:42 PM Anshuman Khandual
>>>> <anshuman.khandual@arm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 04/03/2019 07:28 PM, Robin Murphy wrote:
>>>>>> [ +Dan, Jerome ]
>>>>>>
>>>>>> On 03/04/2019 05:30, Anshuman Khandual wrote:
>>>>>>> Arch implementation for functions which create or destroy vmemmap mapping
>>>>>>> (vmemmap_populate, vmemmap_free) can comprehend and allocate from inside
>>>>>>> device memory range through driver provided vmem_altmap structure which
>>>>>>> fulfils all requirements to enable ZONE_DEVICE on the platform. Hence just
>>>>>>
>>>>>> ZONE_DEVICE is about more than just altmap support, no?
>>>>>
>>>>> Hot plugging the memory into a dev->numa_node's ZONE_DEVICE and initializing the
>>>>> struct pages for it has stand alone and self contained use case. The driver could
>>>>> just want to manage the memory itself but with struct pages either in the RAM or
>>>>> in the device memory range through struct vmem_altmap. The driver may not choose
>>>>> to opt for HMM, FS DAX, P2PDMA (use cases of ZONE_DEVICE) where it may have to
>>>>> map these pages into any user pagetable which would necessitate support for
>>>>> pte|pmd|pud_devmap.
>>>>
>>>> What's left for ZONE_DEVICE if none of the above cases are used?
>>>>
>>>>> Though I am still working towards getting HMM, FS DAX, P2PDMA enabled on arm64,
>>>>> IMHO ZONE_DEVICE is self contained and can be evaluated in itself.
>>>>
>>>> I'm not convinced. What's the specific use case.
>>>
>>> The fundamental "roadmap" reason we've been doing this is to enable
>>> further NVDIMM/pmem development (libpmem/Qemu/etc.) on arm64. The fact
>>> that ZONE_DEVICE immediately opens the door to the various other stuff
>>> that the CCIX folks have interest in is a definite bonus, so it would
>>> certainly be preferable to get arm64 on par with the current state of
>>> things rather than try to subdivide the scope further.
>>>
>>> I started working on this from the ZONE_DEVICE end, but got bogged down
>>> in trying to replace my copied-from-s390 dummy hot-remove implementation
>>> with something proper. Anshuman has stepped in to help with hot-remove
>>> (since we also have cloud folks wanting that for its own sake), so is
>>> effectively coming at the problem from the opposite direction, and I'll
>>> be the first to admit that we've not managed the greatest job of meeting
>>> in the middle and coordinating our upstream story; sorry about that :)
>>>
>>> Let me freshen up my devmap patches and post them properly, since that
>>> discussion doesn't have to happen in the context of hot-remove; they're
>>> effectively just parallel dependencies for ZONE_DEVICE.
>>
>> Sounds good. It's also worth noting that Ira's recent patches for
>> supporting get_user_pages_fast() for "longterm" pins relies on
>> PTE_DEVMAP to determine when fast-GUP is safe to proceed, or whether
>> it needs to fall back to slow-GUP. So it really is the case that
>> "devmap" support is an assumption for ZONE_DEVICE.
> 
> Could you cc me on the patches when you post?

Sure will do.
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index db3e625..b5d8cf5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -31,6 +31,7 @@  config ARM64
 	select ARCH_HAS_SYSCALL_WRAPPER
 	select ARCH_HAS_TEARDOWN_DMA_OPS if IOMMU_SUPPORT
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAS_ZONE_DEVICE if ARM64_4K_PAGES
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT