diff mbox series

[2/4] mm/memory-hotplug: allow memory resources to be children

Message ID 20190116181902.670EEBC3@viggo.jf.intel.com (mailing list archive)
State Superseded
Headers show
Series Allow persistent memory to be used like normal RAM | expand

Commit Message

Dave Hansen Jan. 16, 2019, 6:19 p.m. UTC
From: Dave Hansen <dave.hansen@linux.intel.com>

The mm/resource.c code is used to manage the physical address
space.  We can view the current resource configuration in
/proc/iomem.  An example of this is at the bottom of this
description.

The nvdimm subsystem "owns" the physical address resources which
map to persistent memory and has resources inserted for them as
"Persistent Memory".  We want to use this persistent memory, but
as volatile memory, just like RAM.  The best way to do this is
to leave the existing resource in place, but add a "System RAM"
resource underneath it. This clearly communicates the ownership
relationship of this memory.

The request_resource_conflict() API only deals with the
top-level resources.  Replace it with __request_region() which
will search for !IORESOURCE_BUSY areas lower in the resource
tree than the top level.

We also rework the old error message a bit since we do not get
the conflicting entry back: only an indication that we *had* a
conflict.

We *could* also simply truncate the existing top-level
"Persistent Memory" resource and take over the released address
space.  But, this means that if we ever decide to hot-unplug the
"RAM" and give it back, we need to recreate the original setup,
which may mean going back to the BIOS tables.

This should have no real effect on the existing collision
detection because the areas that truly conflict should be marked
IORESOURCE_BUSY.

00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c97ff : Video ROM
000c9800-000ca5ff : Adapter ROM
000f0000-000fffff : Reserved
  000f0000-000fffff : System ROM
00100000-9fffffff : System RAM
  01000000-01e071d0 : Kernel code
  01e071d1-027dfdff : Kernel data
  02dc6000-0305dfff : Kernel bss
a0000000-afffffff : Persistent Memory (legacy)
  a0000000-a7ffffff : System RAM
b0000000-bffdffff : System RAM
bffe0000-bfffffff : Reserved
c0000000-febfffff : PCI Bus 0000:00

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: linux-nvdimm@lists.01.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Huang Ying <ying.huang@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
---

 b/mm/memory_hotplug.c |   31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

Comments

Jerome Glisse Jan. 16, 2019, 7:16 p.m. UTC | #1
On Wed, Jan 16, 2019 at 10:19:02AM -0800, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The mm/resource.c code is used to manage the physical address
> space.  We can view the current resource configuration in
> /proc/iomem.  An example of this is at the bottom of this
> description.
> 
> The nvdimm subsystem "owns" the physical address resources which
> map to persistent memory and has resources inserted for them as
> "Persistent Memory".  We want to use this persistent memory, but
> as volatile memory, just like RAM.  The best way to do this is
> to leave the existing resource in place, but add a "System RAM"
> resource underneath it. This clearly communicates the ownership
> relationship of this memory.
> 
> The request_resource_conflict() API only deals with the
> top-level resources.  Replace it with __request_region() which
> will search for !IORESOURCE_BUSY areas lower in the resource
> tree than the top level.
> 
> We also rework the old error message a bit since we do not get
> the conflicting entry back: only an indication that we *had* a
> conflict.

We should keep the device private check (moving it in __request_region)
as device private can try to register un-use physical address (un-use
at time of device private registration) that latter can block valid
physical address the error message you are removing report such event.


> 
> We *could* also simply truncate the existing top-level
> "Persistent Memory" resource and take over the released address
> space.  But, this means that if we ever decide to hot-unplug the
> "RAM" and give it back, we need to recreate the original setup,
> which may mean going back to the BIOS tables.
> 
> This should have no real effect on the existing collision
> detection because the areas that truly conflict should be marked
> IORESOURCE_BUSY.

Still i am worrying that this might allow device private to register
itself as a child of some un-busy resource as this patch obviously
change the behavior of register_memory_resource()

What about instead explicitly providing parent resource to add_memory()
and then to register_memory_resource() so if it is provided as an
argument (!NULL) then you can __request_region(arg_res, ...) otherwise
you keep existing code intact ?

Cheers,
Jérôme


> 
> 00000000-00000fff : Reserved
> 00001000-0009fbff : System RAM
> 0009fc00-0009ffff : Reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c97ff : Video ROM
> 000c9800-000ca5ff : Adapter ROM
> 000f0000-000fffff : Reserved
>   000f0000-000fffff : System ROM
> 00100000-9fffffff : System RAM
>   01000000-01e071d0 : Kernel code
>   01e071d1-027dfdff : Kernel data
>   02dc6000-0305dfff : Kernel bss
> a0000000-afffffff : Persistent Memory (legacy)
>   a0000000-a7ffffff : System RAM
> b0000000-bffdffff : System RAM
> bffe0000-bfffffff : Reserved
> c0000000-febfffff : PCI Bus 0000:00
> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/mm/memory_hotplug.c |   31 ++++++++++++++-----------------
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2018-12-20 11:48:42.317771933 -0800
> +++ b/mm/memory_hotplug.c	2018-12-20 11:48:42.322771933 -0800
> @@ -98,24 +98,21 @@ void mem_hotplug_done(void)
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> -	struct resource *res, *conflict;
> -	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> -	if (!res)
> -		return ERR_PTR(-ENOMEM);
> +	struct resource *res;
> +	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +	char *resource_name = "System RAM";
>  
> -	res->name = "System RAM";
> -	res->start = start;
> -	res->end = start + size - 1;
> -	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	conflict =  request_resource_conflict(&iomem_resource, res);
> -	if (conflict) {
> -		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> -			pr_debug("Device unaddressable memory block "
> -				 "memory hotplug at %#010llx !\n",
> -				 (unsigned long long)start);
> -		}
> -		pr_debug("System RAM resource %pR cannot be added\n", res);
> -		kfree(res);
> +	/*
> +	 * Request ownership of the new memory range.  This might be
> +	 * a child of an existing resource that was present but
> +	 * not marked as busy.
> +	 */
> +	res = __request_region(&iomem_resource, start, size,
> +			       resource_name, flags);
> +
> +	if (!res) {
> +		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
> +				start, start + size);
>  		return ERR_PTR(-EEXIST);
>  	}
>  	return res;
> _
>
Dave Hansen Jan. 16, 2019, 11:01 p.m. UTC | #2
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We also rework the old error message a bit since we do not get
>> the conflicting entry back: only an indication that we *had* a
>> conflict.
> We should keep the device private check (moving it in __request_region)
> as device private can try to register un-use physical address (un-use
> at time of device private registration) that latter can block valid
> physical address the error message you are removing report such event.

If a resource can't support having a child, shouldn't it just be marked
IORESOURCE_BUSY, rather than trying to somehow special-case
IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?
Jerome Glisse Jan. 16, 2019, 11:38 p.m. UTC | #3
On Wed, Jan 16, 2019 at 03:01:39PM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We also rework the old error message a bit since we do not get
> >> the conflicting entry back: only an indication that we *had* a
> >> conflict.
> > We should keep the device private check (moving it in __request_region)
> > as device private can try to register un-use physical address (un-use
> > at time of device private registration) that latter can block valid
> > physical address the error message you are removing report such event.
> 
> If a resource can't support having a child, shouldn't it just be marked
> IORESOURCE_BUSY, rather than trying to somehow special-case
> IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?

So the thing about IORES_DESC_DEVICE_PRIVATE_MEMORY is that they
are not necessarily link to any real resource ie they can just be
random range of physical address that at the time of registration
had no resource.

Now you can latter hotplug some memory that would conflict with
this IORES_DESC_DEVICE_PRIVATE_MEMORY and if that happens we want
to tell that to the user ie:
    "Sorry we registered some fake memory at fake physical address
     and now you have hotplug something that conflict with that."


Why no existing resource ? Well it depends on the platform. In some
case memory for HMM is just not accessible by the CPU _at_ all so
there is obviously no physical address from CPU point of view for
this kind of memory. The other case is PCIE and BAR size. If we
have PCIE bar resizing working everywhere we could potentialy
use the resized PCIE bar (thought i think some device have bug on
that front so i need to check device side too). So when HMM was
design without the PCIE resize and with totaly un-accessible memory
the only option was to pick some unuse physical address range as
anyway memory we are hotpluging is not CPU accessible.

It has been on my TODO to try to find a better way to reserve a
physical range but this is highly platform specific. I need to
investigate if i can report to ACPI on x86 that i want to make
sure the system never assign some physical address range.

Checking PCIE bar resize is also on my TODO (on device side as
i think some device are just buggy there and won't accept BAR
bigger than 256MB and freakout if you try).

So right now i would rather that we keep properly reporting this
hazard so that at least we know it failed because of that. This
also include making sure that we can not register private memory
as a child of an un-busy resource that does exist but might not
have yet been claim by its rightful owner.

Existing code make sure of that, with your change this is a case
that i would not be able to stop. Well i would have to hot unplug
and try a different physical address i guess.

Cheers,
Jérôme
Dave Hansen Jan. 18, 2019, 7:58 p.m. UTC | #4
On 1/16/19 11:16 AM, Jerome Glisse wrote:
>> We *could* also simply truncate the existing top-level
>> "Persistent Memory" resource and take over the released address
>> space.  But, this means that if we ever decide to hot-unplug the
>> "RAM" and give it back, we need to recreate the original setup,
>> which may mean going back to the BIOS tables.
>>
>> This should have no real effect on the existing collision
>> detection because the areas that truly conflict should be marked
>> IORESOURCE_BUSY.
> 
> Still i am worrying that this might allow device private to register
> itself as a child of some un-busy resource as this patch obviously
> change the behavior of register_memory_resource()
> 
> What about instead explicitly providing parent resource to add_memory()
> and then to register_memory_resource() so if it is provided as an
> argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> you keep existing code intact ?

We don't have the locking to do this, do we?  For instance, all the
locking is done below register_memory_resource(), so any previous
resource lookup is invalid by the time we get to register_memory_resource().
Jerome Glisse Jan. 18, 2019, 8:26 p.m. UTC | #5
On Fri, Jan 18, 2019 at 11:58:54AM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We *could* also simply truncate the existing top-level
> >> "Persistent Memory" resource and take over the released address
> >> space.  But, this means that if we ever decide to hot-unplug the
> >> "RAM" and give it back, we need to recreate the original setup,
> >> which may mean going back to the BIOS tables.
> >>
> >> This should have no real effect on the existing collision
> >> detection because the areas that truly conflict should be marked
> >> IORESOURCE_BUSY.
> > 
> > Still i am worrying that this might allow device private to register
> > itself as a child of some un-busy resource as this patch obviously
> > change the behavior of register_memory_resource()
> > 
> > What about instead explicitly providing parent resource to add_memory()
> > and then to register_memory_resource() so if it is provided as an
> > argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> > you keep existing code intact ?
> 
> We don't have the locking to do this, do we?  For instance, all the
> locking is done below register_memory_resource(), so any previous
> resource lookup is invalid by the time we get to register_memory_resource().

Yeah you are right, maybe just a bool then ? bool as_child

Cheers,
Jérôme
Michal Hocko Jan. 23, 2019, 5:05 p.m. UTC | #6
[Sorry for a late reply]

On Wed 16-01-19 10:19:02, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The mm/resource.c code is used to manage the physical address
> space.  We can view the current resource configuration in
> /proc/iomem.  An example of this is at the bottom of this
> description.
> 
> The nvdimm subsystem "owns" the physical address resources which
> map to persistent memory and has resources inserted for them as
> "Persistent Memory".  We want to use this persistent memory, but
> as volatile memory, just like RAM.  The best way to do this is
> to leave the existing resource in place, but add a "System RAM"
> resource underneath it. This clearly communicates the ownership
> relationship of this memory.
> 
> The request_resource_conflict() API only deals with the
> top-level resources.  Replace it with __request_region() which
> will search for !IORESOURCE_BUSY areas lower in the resource
> tree than the top level.
> 
> We also rework the old error message a bit since we do not get
> the conflicting entry back: only an indication that we *had* a
> conflict.
> 
> We *could* also simply truncate the existing top-level
> "Persistent Memory" resource and take over the released address
> space.  But, this means that if we ever decide to hot-unplug the
> "RAM" and give it back, we need to recreate the original setup,
> which may mean going back to the BIOS tables.
> 
> This should have no real effect on the existing collision
> detection because the areas that truly conflict should be marked
> IORESOURCE_BUSY.
> 
> 00000000-00000fff : Reserved
> 00001000-0009fbff : System RAM
> 0009fc00-0009ffff : Reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c97ff : Video ROM
> 000c9800-000ca5ff : Adapter ROM
> 000f0000-000fffff : Reserved
>   000f0000-000fffff : System ROM
> 00100000-9fffffff : System RAM
>   01000000-01e071d0 : Kernel code
>   01e071d1-027dfdff : Kernel data
>   02dc6000-0305dfff : Kernel bss
> a0000000-afffffff : Persistent Memory (legacy)
>   a0000000-a7ffffff : System RAM
> b0000000-bffdffff : System RAM
> bffe0000-bfffffff : Reserved
> c0000000-febfffff : PCI Bus 0000:00

This is the only memory hotplug related change in this series AFAICS.
Unfortunately I am not really familiar with guts for resources
infrastructure so I cannot judge the correctness. The change looks
sensible to me although I do not feel like acking it.

Overall design of this feature makes a lot of sense to me. It doesn't
really add any weird APIs yet it allows to use nvdimms as a memory
transparently. All future policies are to be defined by the userspace
and I like that. I was especially astonished by the sheer size of the
driver and changes it required to achieve that. Really nice!

> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Ross Zwisler <zwisler@kernel.org>
> Cc: Vishal Verma <vishal.l.verma@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Fengguang Wu <fengguang.wu@intel.com>
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
> 
>  b/mm/memory_hotplug.c |   31 ++++++++++++++-----------------
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2018-12-20 11:48:42.317771933 -0800
> +++ b/mm/memory_hotplug.c	2018-12-20 11:48:42.322771933 -0800
> @@ -98,24 +98,21 @@ void mem_hotplug_done(void)
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> -	struct resource *res, *conflict;
> -	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> -	if (!res)
> -		return ERR_PTR(-ENOMEM);
> +	struct resource *res;
> +	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> +	char *resource_name = "System RAM";
>  
> -	res->name = "System RAM";
> -	res->start = start;
> -	res->end = start + size - 1;
> -	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	conflict =  request_resource_conflict(&iomem_resource, res);
> -	if (conflict) {
> -		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> -			pr_debug("Device unaddressable memory block "
> -				 "memory hotplug at %#010llx !\n",
> -				 (unsigned long long)start);
> -		}
> -		pr_debug("System RAM resource %pR cannot be added\n", res);
> -		kfree(res);
> +	/*
> +	 * Request ownership of the new memory range.  This might be
> +	 * a child of an existing resource that was present but
> +	 * not marked as busy.
> +	 */
> +	res = __request_region(&iomem_resource, start, size,
> +			       resource_name, flags);
> +
> +	if (!res) {
> +		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
> +				start, start + size);
>  		return ERR_PTR(-EEXIST);
>  	}
>  	return res;
> _
Dave Hansen Jan. 23, 2019, 8:03 p.m. UTC | #7
On 1/16/19 3:38 PM, Jerome Glisse wrote:
> So right now i would rather that we keep properly reporting this
> hazard so that at least we know it failed because of that. This
> also include making sure that we can not register private memory
> as a child of an un-busy resource that does exist but might not
> have yet been claim by its rightful owner.

I can definitely keep the warning in.  But, I don't think there's a
chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
the child of another.  The region_intersects() check *should* find that:

>         for (; addr > size && addr >= iomem_resource.start; addr -= size) {
>                 ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
>                 if (ret != REGION_DISJOINT)
>                         continue;
Jerome Glisse Jan. 23, 2019, 8:15 p.m. UTC | #8
On Wed, Jan 23, 2019 at 12:03:54PM -0800, Dave Hansen wrote:
> On 1/16/19 3:38 PM, Jerome Glisse wrote:
> > So right now i would rather that we keep properly reporting this
> > hazard so that at least we know it failed because of that. This
> > also include making sure that we can not register private memory
> > as a child of an un-busy resource that does exist but might not
> > have yet been claim by its rightful owner.
> 
> I can definitely keep the warning in.  But, I don't think there's a
> chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
> the child of another.  The region_intersects() check *should* find that:

Sounds fine to (just keep the warning).

Cheers,
Jérôme

> 
> >         for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> >                 ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> >                 if (ret != REGION_DISJOINT)
> >                         continue;
>
diff mbox series

Patch

diff -puN mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child	2018-12-20 11:48:42.317771933 -0800
+++ b/mm/memory_hotplug.c	2018-12-20 11:48:42.322771933 -0800
@@ -98,24 +98,21 @@  void mem_hotplug_done(void)
 /* add this memory to iomem resource */
 static struct resource *register_memory_resource(u64 start, u64 size)
 {
-	struct resource *res, *conflict;
-	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
-	if (!res)
-		return ERR_PTR(-ENOMEM);
+	struct resource *res;
+	unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+	char *resource_name = "System RAM";
 
-	res->name = "System RAM";
-	res->start = start;
-	res->end = start + size - 1;
-	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	conflict =  request_resource_conflict(&iomem_resource, res);
-	if (conflict) {
-		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
-			pr_debug("Device unaddressable memory block "
-				 "memory hotplug at %#010llx !\n",
-				 (unsigned long long)start);
-		}
-		pr_debug("System RAM resource %pR cannot be added\n", res);
-		kfree(res);
+	/*
+	 * Request ownership of the new memory range.  This might be
+	 * a child of an existing resource that was present but
+	 * not marked as busy.
+	 */
+	res = __request_region(&iomem_resource, start, size,
+			       resource_name, flags);
+
+	if (!res) {
+		pr_debug("Unable to reserve System RAM region: %016llx->%016llx\n",
+				start, start + size);
 		return ERR_PTR(-EEXIST);
 	}
 	return res;