[RFC,1/7] x86, mm: ZONE_DEVICE for "device memory"

Message ID	20150813035005.36913.77364.stgit@otcpl-skl-sds-2.jf.intel.com (mailing list archive)
State	RFC
Delegated to:	Dan Williams
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Subject: [RFC PATCH 1/7] x86, mm: ZONE_DEVICE for "device memory" From: Dan Williams <dan.j.williams@intel.com> To: linux-kernel@vger.kernel.org Date: Wed, 12 Aug 2015 23:50:05 -0400 Message-ID: <20150813035005.36913.77364.stgit@otcpl-skl-sds-2.jf.intel.com> In-Reply-To: <20150813031253.36913.29580.stgit@otcpl-skl-sds-2.jf.intel.com> References: <20150813031253.36913.29580.stgit@otcpl-skl-sds-2.jf.intel.com> User-Agent: StGit/0.17.1-8-g92dd MIME-Version: 1.0 Cc: riel@redhat.com, linux-nvdimm@lists.01.org, Dave Hansen <dave.hansen@linux.intel.com>, david@fromorbit.com, hch@lst.de, linux-mm@kvack.org, Ingo Molnar <mingo@redhat.com>, mgorman@suse.de, "H. Peter Anvin" <hpa@zytor.com>, torvalds@linux-foundation.org, mingo@kernel.org Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams Aug. 13, 2015, 3:50 a.m. UTC

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory.  Device memory has
different lifetime and performance characteristics than RAM.  However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

arch_add_memory() is reorganized a bit in preparation for a new
arch_add_dev_memory() api, for now there is no functional change to the
memory hotplug code.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/Kconfig       |   13 +++++++++++++
 arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
 include/linux/mmzone.h |   23 +++++++++++++++++++++++
 mm/memory_hotplug.c    |    5 ++++-
 mm/page_alloc.c        |    3 +++
 5 files changed, 64 insertions(+), 12 deletions(-)

Jerome Glisse Aug. 14, 2015, 9:37 p.m. UTC | #1

On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> While pmem is usable as a block device or via DAX mappings to userspace
> there are several usage scenarios that can not target pmem due to its
> lack of struct page coverage. In preparation for "hot plugging" pmem
> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
> separately from the ones that are subject to standard page allocations.
> Importantly "device memory" can be removed at will by userspace
> unbinding the driver of the device.
> 
> Having a separate zone prevents allocation and otherwise marks these
> pages that are distinct from typical uniform memory.  Device memory has
> different lifetime and performance characteristics than RAM.  However,
> since we have run out of ZONES_SHIFT bits this functionality currently
> depends on sacrificing ZONE_DMA.
> 
> arch_add_memory() is reorganized a bit in preparation for a new
> arch_add_dev_memory() api, for now there is no functional change to the
> memory hotplug code.
> 
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: linux-mm@kvack.org
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/Kconfig       |   13 +++++++++++++
>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
>  mm/memory_hotplug.c    |    5 ++++-
>  mm/page_alloc.c        |    3 +++
>  5 files changed, 64 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b3a1a5d77d92..64829b17980b 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -308,6 +308,19 @@ config ZONE_DMA
>  
>  	  If unsure, say Y.
>  
> +config ZONE_DEVICE
> +	bool "Device memory (pmem, etc...) hotplug support" if EXPERT
> +	default !ZONE_DMA
> +	depends on !ZONE_DMA
> +	help
> +	  Device memory hotplug support allows for establishing pmem,
> +	  or other device driver discovered memory regions, in the
> +	  memmap. This allows pfn_to_page() lookups of otherwise
> +	  "device-physical" addresses which is needed for using a DAX
> +	  mapping in an O_DIRECT operation, among other things.
> +
> +	  If FS_DAX is enabled, then say Y.
> +
>  config SMP
>  	bool "Symmetric multi-processing support"
>  	---help---
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 3fba623e3ba5..94f0fa56f0ed 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -683,15 +683,8 @@ static void  update_end_of_memory_vars(u64 start, u64 size)
>  	}
>  }
>  
> -/*
> - * Memory is added always to NORMAL zone. This means you will never get
> - * additional DMA/DMA32 memory.
> - */
> -int arch_add_memory(int nid, u64 start, u64 size)
> +static int __arch_add_memory(int nid, u64 start, u64 size, struct zone *zone)
>  {
> -	struct pglist_data *pgdat = NODE_DATA(nid);
> -	struct zone *zone = pgdat->node_zones +
> -		zone_for_memory(nid, start, size, ZONE_NORMAL);
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
>  	int ret;
> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  	ret = __add_pages(nid, zone, start_pfn, nr_pages);
>  	WARN_ON_ONCE(ret);
>  
> -	/* update max_pfn, max_low_pfn and high_memory */
> -	update_end_of_memory_vars(start, size);
> +	/*
> +	 * Update max_pfn, max_low_pfn and high_memory, unless we added
> +	 * "device memory" which should not effect max_pfn
> +	 */
> +	if (!is_dev_zone(zone))
> +		update_end_of_memory_vars(start, size);

What is the rational for not updating max_pfn, max_low_pfn, ... ?

Cheers,
Jérôme

Dan Williams Aug. 14, 2015, 9:52 p.m. UTC | #2

On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
>> While pmem is usable as a block device or via DAX mappings to userspace
>> there are several usage scenarios that can not target pmem due to its
>> lack of struct page coverage. In preparation for "hot plugging" pmem
>> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
>> separately from the ones that are subject to standard page allocations.
>> Importantly "device memory" can be removed at will by userspace
>> unbinding the driver of the device.
>>
>> Having a separate zone prevents allocation and otherwise marks these
>> pages that are distinct from typical uniform memory.  Device memory has
>> different lifetime and performance characteristics than RAM.  However,
>> since we have run out of ZONES_SHIFT bits this functionality currently
>> depends on sacrificing ZONE_DMA.
>>
>> arch_add_memory() is reorganized a bit in preparation for a new
>> arch_add_dev_memory() api, for now there is no functional change to the
>> memory hotplug code.
>>
>> Cc: H. Peter Anvin <hpa@zytor.com>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: linux-mm@kvack.org
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  arch/x86/Kconfig       |   13 +++++++++++++
>>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
>>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
>>  mm/memory_hotplug.c    |    5 ++++-
>>  mm/page_alloc.c        |    3 +++
>>  5 files changed, 64 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index b3a1a5d77d92..64829b17980b 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -308,6 +308,19 @@ config ZONE_DMA
>>
>>         If unsure, say Y.
>>
>> +config ZONE_DEVICE
>> +     bool "Device memory (pmem, etc...) hotplug support" if EXPERT
>> +     default !ZONE_DMA
>> +     depends on !ZONE_DMA
>> +     help
>> +       Device memory hotplug support allows for establishing pmem,
>> +       or other device driver discovered memory regions, in the
>> +       memmap. This allows pfn_to_page() lookups of otherwise
>> +       "device-physical" addresses which is needed for using a DAX
>> +       mapping in an O_DIRECT operation, among other things.
>> +
>> +       If FS_DAX is enabled, then say Y.
>> +
>>  config SMP
>>       bool "Symmetric multi-processing support"
>>       ---help---
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 3fba623e3ba5..94f0fa56f0ed 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
[..]
>> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>       ret = __add_pages(nid, zone, start_pfn, nr_pages);
>>       WARN_ON_ONCE(ret);
>>
>> -     /* update max_pfn, max_low_pfn and high_memory */
>> -     update_end_of_memory_vars(start, size);
>> +     /*
>> +      * Update max_pfn, max_low_pfn and high_memory, unless we added
>> +      * "device memory" which should not effect max_pfn
>> +      */
>> +     if (!is_dev_zone(zone))
>> +             update_end_of_memory_vars(start, size);
>
> What is the rational for not updating max_pfn, max_low_pfn, ... ?
>

The idea is that this memory is not meant to be available to the page
allocator and should not count as new memory capacity.  We're only
hotplugging it to get struct page coverage.

Jerome Glisse Aug. 14, 2015, 10:06 p.m. UTC | #3

On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> >> While pmem is usable as a block device or via DAX mappings to userspace
> >> there are several usage scenarios that can not target pmem due to its
> >> lack of struct page coverage. In preparation for "hot plugging" pmem
> >> into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
> >> separately from the ones that are subject to standard page allocations.
> >> Importantly "device memory" can be removed at will by userspace
> >> unbinding the driver of the device.
> >>
> >> Having a separate zone prevents allocation and otherwise marks these
> >> pages that are distinct from typical uniform memory.  Device memory has
> >> different lifetime and performance characteristics than RAM.  However,
> >> since we have run out of ZONES_SHIFT bits this functionality currently
> >> depends on sacrificing ZONE_DMA.
> >>
> >> arch_add_memory() is reorganized a bit in preparation for a new
> >> arch_add_dev_memory() api, for now there is no functional change to the
> >> memory hotplug code.
> >>
> >> Cc: H. Peter Anvin <hpa@zytor.com>
> >> Cc: Ingo Molnar <mingo@redhat.com>
> >> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> >> Cc: Rik van Riel <riel@redhat.com>
> >> Cc: Mel Gorman <mgorman@suse.de>
> >> Cc: linux-mm@kvack.org
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  arch/x86/Kconfig       |   13 +++++++++++++
> >>  arch/x86/mm/init_64.c  |   32 +++++++++++++++++++++-----------
> >>  include/linux/mmzone.h |   23 +++++++++++++++++++++++
> >>  mm/memory_hotplug.c    |    5 ++++-
> >>  mm/page_alloc.c        |    3 +++
> >>  5 files changed, 64 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index b3a1a5d77d92..64829b17980b 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -308,6 +308,19 @@ config ZONE_DMA
> >>
> >>         If unsure, say Y.
> >>
> >> +config ZONE_DEVICE
> >> +     bool "Device memory (pmem, etc...) hotplug support" if EXPERT
> >> +     default !ZONE_DMA
> >> +     depends on !ZONE_DMA
> >> +     help
> >> +       Device memory hotplug support allows for establishing pmem,
> >> +       or other device driver discovered memory regions, in the
> >> +       memmap. This allows pfn_to_page() lookups of otherwise
> >> +       "device-physical" addresses which is needed for using a DAX
> >> +       mapping in an O_DIRECT operation, among other things.
> >> +
> >> +       If FS_DAX is enabled, then say Y.
> >> +
> >>  config SMP
> >>       bool "Symmetric multi-processing support"
> >>       ---help---
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 3fba623e3ba5..94f0fa56f0ed 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> [..]
> >> @@ -701,11 +694,28 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>       ret = __add_pages(nid, zone, start_pfn, nr_pages);
> >>       WARN_ON_ONCE(ret);
> >>
> >> -     /* update max_pfn, max_low_pfn and high_memory */
> >> -     update_end_of_memory_vars(start, size);
> >> +     /*
> >> +      * Update max_pfn, max_low_pfn and high_memory, unless we added
> >> +      * "device memory" which should not effect max_pfn
> >> +      */
> >> +     if (!is_dev_zone(zone))
> >> +             update_end_of_memory_vars(start, size);
> >
> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
> >
> 
> The idea is that this memory is not meant to be available to the page
> allocator and should not count as new memory capacity.  We're only
> hotplugging it to get struct page coverage.

But this sounds bogus to me to rely on max_pfn to stay smaller than
first_dev_pfn.  For instance you might plug a device that register
dev memory and then some regular memory might be hotplug, effectively
updating max_pfn to a value bigger than first_dev_pfn.

Also i do not think that the buddy allocator use max_pfn or max_low_pfn
to consider page/zone for allocation or not.

Cheers,
Jérôme

Dan Williams Aug. 14, 2015, 10:33 p.m. UTC | #4

On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
[..]
>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
>> >
>>
>> The idea is that this memory is not meant to be available to the page
>> allocator and should not count as new memory capacity.  We're only
>> hotplugging it to get struct page coverage.
>
> But this sounds bogus to me to rely on max_pfn to stay smaller than
> first_dev_pfn.  For instance you might plug a device that register
> dev memory and then some regular memory might be hotplug, effectively
> updating max_pfn to a value bigger than first_dev_pfn.
>

True.

> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
> to consider page/zone for allocation or not.

Yes, I took it out with no effects.  I'll investigate further whether
we should be touching those variables or not for this new usage.

Dan Williams Aug. 15, 2015, 2:11 a.m. UTC | #5

On Fri, Aug 14, 2015 at 3:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> [..]
>>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
>>> >
>>>
>>> The idea is that this memory is not meant to be available to the page
>>> allocator and should not count as new memory capacity.  We're only
>>> hotplugging it to get struct page coverage.
>>
>> But this sounds bogus to me to rely on max_pfn to stay smaller than
>> first_dev_pfn.  For instance you might plug a device that register
>> dev memory and then some regular memory might be hotplug, effectively
>> updating max_pfn to a value bigger than first_dev_pfn.
>>
>
> True.
>
>> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
>> to consider page/zone for allocation or not.
>
> Yes, I took it out with no effects.  I'll investigate further whether
> we should be touching those variables or not for this new usage.

Although it does not offer perfect protection if device memory is at a
physically lower address than RAM, skipping the update of these
variables does seem to be what we want.  For example /dev/mem would
fail to allow write access to persistent memory if it fails a
valid_phys_addr_range() check.  Since /dev/mem does not know how to
write to PMEM in a reliably persistent way, it should not treat a
PMEM-pfn like RAM.

Christoph Hellwig Aug. 15, 2015, 8:59 a.m. UTC | #6

On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> The idea is that this memory is not meant to be available to the page
> allocator and should not count as new memory capacity.  We're only
> hotplugging it to get struct page coverage.

This might need a bigger audit of the max_pfn usages.  I remember
architectures using it as a decisions for using IOMMUs or similar.

Christoph Hellwig Aug. 15, 2015, 1:33 p.m. UTC | #7

On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> arch_add_memory() is reorganized a bit in preparation for a new
> arch_add_dev_memory() api, for now there is no functional change to the
> memory hotplug code.

Instead of the new arch_add_dev_memory call I'd just add a bool device
argument to arch_add_memory and zone_for_memory (and later the altmap
pointer aswell).

arch_add_memory is a candidate to be factored into common code,
except for s390 everything could be done with two small arch callouts.

Dan Williams Aug. 21, 2015, 3:02 p.m. UTC | #8

[ Adding David Woodhouse ]

On Sat, Aug 15, 2015 at 1:59 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
>> The idea is that this memory is not meant to be available to the page
>> allocator and should not count as new memory capacity.  We're only
>> hotplugging it to get struct page coverage.
>
> This might need a bigger audit of the max_pfn usages.  I remember
> architectures using it as a decisions for using IOMMUs or similar.

We chatted about this at LPC yesterday.  The takeaway was that the
max_pfn checks that the IOMMU code does is for checking whether a
device needs an io-virtual mapping to reach addresses above its DMA
limit (if it can't do 64-bit DMA).  Given the capacities of persistent
memory it's likely that a device with this limitation already can't
address all of RAM let alone PMEM.   So it seems to me that updating
max_pfn for PMEM hotplug does not buy us anything except a few more
opportunities to confuse PMEM as typical RAM.

Jerome Glisse Aug. 21, 2015, 3:15 p.m. UTC | #9

On Fri, Aug 21, 2015 at 08:02:51AM -0700, Dan Williams wrote:
> [ Adding David Woodhouse ]
> 
> On Sat, Aug 15, 2015 at 1:59 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> >> The idea is that this memory is not meant to be available to the page
> >> allocator and should not count as new memory capacity.  We're only
> >> hotplugging it to get struct page coverage.
> >
> > This might need a bigger audit of the max_pfn usages.  I remember
> > architectures using it as a decisions for using IOMMUs or similar.
> 
> We chatted about this at LPC yesterday.  The takeaway was that the
> max_pfn checks that the IOMMU code does is for checking whether a
> device needs an io-virtual mapping to reach addresses above its DMA
> limit (if it can't do 64-bit DMA).  Given the capacities of persistent
> memory it's likely that a device with this limitation already can't
> address all of RAM let alone PMEM.   So it seems to me that updating
> max_pfn for PMEM hotplug does not buy us anything except a few more
> opportunities to confuse PMEM as typical RAM.

I think it is wrong do not update max_pfn for 3 reasons :
  - In some case your PMEM memory will end up below current max_pfn
    value so device doing DMA can confuse your PMEM for regular RAM.
  - Given the above, not updating PMEM means you are not consistant,
    ie on some computer PMEM will be DMA addressable by device and
    on other computer it would not. Because different RAM and PMEM
    configuration, different bios, ... that cause max_pfn to be below
    range where PMEM get hotpluged.
  - Last why would we want to block device to access PMEM directly ?
    Wouldn't it make sense for some device like say network to read
    PMEM directly and stream it over the network ? All this happening
    through IOMMU (i am assuming PCIE network card using IOMMU). Which
    imply having the IOMMU consider this like regular mapping (ignoring
    Will Davis recent patchset here that might affect the IOMMU max_pfn
    test).

Cheers,
Jérôme

[RFC,1/7] x86, mm: ZONE_DEVICE for "device memory"

Commit Message

Comments

Patch